Preface

This is an ever-evolving set of lecture notes for Introduction to Stochastic Processes (M362M). It should start with me explaining what stochastic processes are. Instead, here is a list of several questions you will be able to give answers to when you complete this course.

Question 1 In a simplistic model, the price of a share of a stock goes either up or down by \(\$1\) each day, with probability \(1/2\). You own a single share whose value today is \(\$100\), so that its tomorrow’s price will be \(\$101\) or \(\$99\) with probability \(1/2\), etc. Your strategy is to hold onto your share until one of the following two things happen: you go bankrupt (the stock price hits \(0\)), or you make a \(\$50\) dollar profit (the stock price hits \(\$150\).)

  1. How likely is it that you will make a profit before you go bankrupt?
  2. How long will it take?
  3. Is it possible that it takes forever, i.e., that the stock price hovers between \(\$1\) and \(\$149\) forever?

Question 2. A person carrying a certain disease infects a random number of people in a week, and then stops being infectious. Each of the infected people transmits the disease in the same way, etc. Suppose that the number of people each (infectious) individual infects is either \(0\), \(1\) or \(2\) or \(3\), each with probability \(1/4\) and that different infectious individuals may infect different number of people and behave independently of each other.

  1. What is the probability that the disease will ever be eradicated?
  2. What is the probability that every single individual in the population of \(328,000,000\) will eventually be infected?

Question 3. In a game of tennis, Player \(1\) wins against Player \(2\) in each rally (the smallest chunk of the match that leads to point, i.e., to a score change from \(15-30\) to \(30-30\), for example) with probability \(p\). What is the probability that Player \(1\) wins

  1. a game (the chunk of the match that leads to a score change such as \(5-3\) to \(6-3\) within a set)?
  2. a set? the entire match?
  3. Is the game of tennis set up in such a way that is amplifies or reduces the difference in skill between players?

Question 4. A knight starts in the lower left corner of the chess board and starts moving “randomly”. That means that from any position, it chooses one of the possible (legal) moves and takes it, with all legal moves having the same probability. It keeps doing the same thing until it comes back to the square it started from.

  1. What is the expected number of moves the knight will make before it returns to “square one”?
  2. How about the same problem, but using a different chess piece? Which one do you think will come back is the smallest (expected) number of steps?
  3. (*) How about the same problem, but until all squares have been visited at least once?

Question 5. How does Google search work?

An intro to R and RStudio

Setting up an R environment on your computer

Installing R

Learning basic R is an important part of this course, and the first order of business is to download and install an R distribution on your personal computer. We will be using RStudio as an IDE (integrated development environment). Like R itself, it is free and readily available for all major platforms. To download R to your computer, go to https://cran.rstudio.com and download the version of R for your operating system (Windows, Mac or Linux). If you are on a Mac, you want the “Latest release” which, at the time of writing, is 4.3.2, with code name “Eye Holes”. On Windows, follow the link “install R for the first time”. We are not going to do any cutting edge stuff in this class, so an older release should be fine, too, if you happen to have it already installed on your system. Once you download the installation file (.pkg on a Mac or .exe on Windows), run it and follow instructions. If you are running Linux, you don’t need me to tell you what to do. Once it is successfully installed, don’t run the installed app. We will use RStudio for that.

Installing RStudio

To install RStudio, go to https://posit.co/download/rstudio-desktop/, and follow the instructions under “2: Install RStudio”. After you download and install it, you are ready to run it. When it opens, you will see something like this

The part on the left is called the console and that is (one of the places) where you enter commands. Before you do, it is important to adjust a few settings. Open the options window by navigating to to Tools->Global Options. In there, uncheck “Restore .RData into workspace on startup” and set “Save workspace to .RData on exit” to “Never”, as shown below:

This way, R will not pollute your environment with values you defined two weeks ago and completely forgot about. These settings are really an atavism and serve no purpose (for users like us) other than to introduce hard-to-track bugs.

There are many other settings you can play with in RStudio, but the two I mentioned above are the only ones that I really recommend setting as soon as you install it.

Installing basic packages

Finally, we need to install several R packages we will be using (mostly implicitly) during the class. First, run the following command in your console

install.packages( "tidyverse")

If R asks “Do you want to install from sources the packages which need compilation? (Yes/no/cancel)” answer no.

This will install a number of useful packages and should only take about a minute or two. The next part is a bit longer, and can take up to 15 minutes if you have a slow computer/internet connection. You only have to do it once, though. Skip both steps involving tinytex below if you have LaTeX already installed on your system1. Start with

install.packages("tinytex")

followed by

tinytex::install_tinytex()

Note that if you go to the top right corner of each of the code blocks (gray boxes) containing instructions above, an icon will appear. If you click on it, it will copy the content of the box into your clipboard, and you can simply paste it into RStudio. You can do that with any code block in these notes.

Learning the basics of R

Once R and RStudio are on your computer, it is time to get acquainted with the basics of R. This class is not about the finer points of R itself, and I will try to make your R experience as smooth as possible. After all, R is a tool that will help us explore and understand stochastic processes. Having said that, it is important to realize that R is a powerful programming language specifically created for statistical and probabilistic applications. Some knowledge of R is a valuable skill to have in today’s job market, and you should take this opportunity to learn it. The best way, of course, is by using it, but before you start, you need to know the very basics. Don’t worry, R is very user friendly and easy to get started in. In addition, it has been around for a long time (its predecessor S appeared in 1976) and is extremely well documented - google introduction to R or a similar phrase, and you will get lots of useful hits.

My plan is to give you a bare minimum in the next few paragraphs, and then to explain additional R concepts as we need them. This way, you will not be overwhelmed right from the start, and you will get a bit of a mathematical context as you learn more. Conversely, learning R commands will help with the math, too.

The console, Scripts and R Notebooks

There at least three different ways of inputting commands into R - through console, scripts and R-notebooks.

The console, as I already mentioned, is a window in RStudio where you can enter your R commands one by one. As a command is entered (and enter pressed) R will run it and display the result below. A typical console session looks like this If you define a variable in a command, it will be available in all the subsequent commands. This way of interacting with R is perfect for quick-and-dirty computations and, what is somewhat euphemistically called “prototyping”. In other words, this way you are using R as a calculator. There is another reason why you might be using the console. It is perfect for package installation and for help-related commands. If you type help('log'), the output will appear in the Help pane on the right. You can also see all the available variables in the Environment pane on the (top) right.

As your needs increase, you will need more complex (and longer) code to meet them. This is where scripts come in. They are text files (but have the extension .R) that hold R code. Scripts can run as a whole, and be saved for later. To create a new script, go to File->New File->R Script. That will split your RStudio window in two: The top part will become a script editor, and your console will shrink to occupy the bottom part. You can write you code in there, edit and update it, and then run the whole script by clicking on Source, or pressing the associated shortcut key.

Inspired by Python Jupyter notebooks, R notebooks are a creature somewhere between scripts and the console, but also have some features of their own. An R notebook is nothing other than a specially formatted text file which contains chunks of R code mixed with regular text. You can think of these chunks as mini scripts. What differentiates them from scripts is that chunks can be executed (evaluated) and the output becomes a part of the notebook: R notebooks are R’s implementation of literate programming. The idea is that documentation should be written at the same time as the program itself. As far as this course is concerned, R notebooks are just the right medium for homework and exam submission. You can run code and provide the interpretation of its output in a single document. See here for more information.

Each chapter in these lecture notes is an R notebook!

Asking for help

The most important thing about learning R (and many other things, for that matter) is knowing whom (and how) to ask for help. Luckily, R is a well established language, and you can get a lot of information by simply googling your problem. For example, if you google logarithm in R the top hit (at the time of writing) gives a nice overview and some examples.

Another way to get information about a command or a concept in R is to use the command help. For example, if you input help("log") or ?log in your console, the right hand of your screen will display information on the function log and some of its cousins. Almost every help entry has examples at the bottom, and that is where I always go first.

Vectors

Objects we will be manipulating in this class are almost exclusively vectors and matrices. The simplest vectors are those that have a single component, in other words, numbers. In R, you can assign a number to a variable using two different notations. Both

a <- 1

and

a = 1

will assign the value \(1\) to the variable a. If you want to create a longer vector, you can use the concatenation operator c as follows:

x = c(1, 2, 3, 4)

Once you evaluate the above in your console, the value of x is stored and you can access it by using the command print

print(x)
## [1] 1 2 3 4

or simply evaluating x itself:

x
## [1] 1 2 3 4

Unlike all code blocks above them, the last two contain both input and output. It is standard not to mark the output by any symbol (like the usual >), and to mark the output by ## which otherwise marks comments. This way, you can copy any code block from these notes and paste it into the console (or your script) without having to modify it in any way. Try it!

We built the vector x above by concatenating four numbers (vectors of length 1). You can concatenate vectors of different sizes, too:

a = c(1, 2, 3)
b = c(4, 5, 6)
(x = c(a, b, 7))
## [1] 1 2 3 4 5 6 7

You may be wondering why I put x = c(a,b,7) in parentheses. Without them, x would still become (1,2,3,4,5,6,7), but its value would not be printed out. A statement in parentheses is not only evaluated, but its result is also printed out. This way, (x = 2+3) is equivalent to x = 2+3 followed by x or print(x).

Vectors can contain things other than numbers. Strings, for example:

(x = c("Picard", "Data", "Geordi"))
## [1] "Picard" "Data"   "Geordi"

If you need a vector consisting of consecutive numbers, use the colon : notation:

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10

For sequences of equally spaced numbers, use the command seq (check its help for details)

seq(from = 5, to = 20, by = 3)
## [1]  5  8 11 14 17 20

An important feature or R is that many of its functions are vectorized. That means that if you give such a function a vector as an argument, the returned value will be a vector of results of that operation performed element by element. For example

x = c(10, 20, 30)
y = c(2, 4, 5)
x + y
## [1] 12 24 35
x * y
## [1]  20  80 150
x^2
## [1] 100 400 900
cos(x)
## [1] -0.8390715  0.4080821  0.1542514

The vectors do not need to be of the same size. R uses the recycling rule - it recycles the values of the shorter one, starting from the beginning, until its size matches the longer one:

x = c(10, 20, 30, 40, 50, 60)
y = c(1, 3)
x + y
## [1] 11 23 31 43 51 63

The case where the shorter vector is of length 1 is particularly useful:

x = c(10, 20, 30, 40)
x + 1
## [1] 11 21 31 41
x * (-2)
## [1] -20 -40 -60 -80

Extracting parts of the vector is accomplished by using the indexing operator []. Here are some examples (what do negative numbers do?)

x = c(10, 20, 30, 40, 50)
x[1]
## [1] 10
x[c(1, 2)]
## [1] 10 20
x[-1]
## [1] 20 30 40 50
x[-c(3, 4)]
## [1] 10 20 50
x[1:4]
## [1] 10 20 30 40
x[c(1, 1, 2, 2, 5, 4)]
## [1] 10 10 20 20 50 40

People familiar with Python should be aware of the following two differences: 1. indexing starts at 1 and not 0, and 2. negative indexing removes components; it does not start counting from the end!

It is important to note that the thing you put inside [] needs to be a vector itself. The above examples all dealt with numerical indices, but you can use logical indices, too. A variable is said to be logical or Boolean if it can take only one of the two values TRUE or FALSE. A vector whose components are all logical, are called, of course, logical vectors. You can think of logical indexing as the operation where you go through your original vector, and choose which components you want to keep (TRUE) and which you want the throw away (FALSE). For example

x = c(10, 20, 30, 40, 50)
y = c(TRUE, FALSE, FALSE, TRUE, TRUE)
x[y]
## [1] 10 40 50

This is especially useful when used together with the comparison operators. The expressions like x < y or x == y are operators2 in R, just like x + y or x / y. The difference is that < and == return logical values. For example

1 == 2
## [1] FALSE
3 > 4
## [1] FALSE
3 >= 2
## [1] TRUE

These operators are vectorized, so you can do things like this

x = c(1, 2, 3, 4, 5)
y = c(1, 3, 3, 2, 5)
x == y
## [1]  TRUE FALSE  TRUE FALSE  TRUE

or, using recycling,

x = c(1, 2, 3, 4, 5)
x > 3
## [1] FALSE FALSE FALSE  TRUE  TRUE

Let’s combine that with indexing. Suppose that we want to keep only the values greater than 4 in the vector x. The vector y = ( x > 4 ) is going to be of the same length as x and contain logical values. When we index x using it, only the values of x on positions where x > 4 will survive, and these are exactly the values we needed:

x = c(3, 2, 5, 3, 1, 5, 6, 4)
y = (x > 4)
x[y]
## [1] 5 5 6

or, simply,

x[x > 4]
## [1] 5 5 6

Indexing can be used to set the values of a vector just as easily

x = c(10, 20, 30, 40, 50)
x[2:4] = c(0, 1, 2)
x
## [1] 10  0  1  2 50

Recycling rules apply in the same way as above

x = c(10, 20, 30, 40, 50)
x[c(1, 2, 5)] = 7
x
## [1]  7  7 30 40  7

Matrices

A matrix in R can be created using the command matrix. The unusual part is that the input is a vector and R populates the components of the matrix by filling it in column by column or row by row. As always, an example will make this clear

x = c(1, 2, 3, 4, 5, 6)
(A = matrix(x, nrow = 2, ncol = 3, byrow = TRUE))
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

The first argument of the function matrix is the vector which contains all the values. If you want a matrix with m rows and n columns, this vector should be of size \(m n\). The arguments ncol and nrow are self-explanatory, and byrow is a logical argument which signals whether to fill by columns or by rows. Here is what happens when we set byrow = FALSE

x = c(1, 2, 3, 4, 5, 6)
(A = matrix(x, nrow = 2, ncol = 3, byrow = FALSE))
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

Accessing components of a matrix is as intuitive as it gets

(A = matrix(c(1, -1, 7, 2), nrow = 2, ncol = 2))
##      [,1] [,2]
## [1,]    1    7
## [2,]   -1    2
A[1, 2]
## [1] 7

Note that I did not use the argument byrow at all. In such cases, R always uses the default value (documented in the function’s help). For matrix the default value of byrow is FALSE, i.e., it fills the matrix column by column. This is not what we usually want because we tend to think of matrices as composed of rows. Moral: do not forget byrow = TRUE if that is what you, indeed, want.

Usual matrix operations can be performed in R in the obvious way

(A = matrix(c(1, -1, 7, 2), nrow = 2, ncol = 2))
##      [,1] [,2]
## [1,]    1    7
## [2,]   -1    2
(B = matrix(c(2, 2, -3, -4), nrow = 2, ncol = 2))
##      [,1] [,2]
## [1,]    2   -3
## [2,]    2   -4
A + B
##      [,1] [,2]
## [1,]    3    4
## [2,]    1   -2

You should be careful with matrix multiplication. The naive operator * yields a matrix, but probably not the one you want (what does * do?)

(A = matrix(c(1, 2, 0, 1), nrow = 2, ncol = 2))
##      [,1] [,2]
## [1,]    1    0
## [2,]    2    1
(B = matrix(c(3, 5, 1, 0), nrow = 2, ncol = 2))
##      [,1] [,2]
## [1,]    3    1
## [2,]    5    0
A * B
##      [,1] [,2]
## [1,]    3    0
## [2,]   10    0

If you want the matrix product, you have to use %*%

A %*% B
##      [,1] [,2]
## [1,]    3    1
## [2,]   11    2

Functions

The following syntax is used to define functions in R:

my_function = function(x, y, z) {
    return(x + y + z)
}

The function my_function returns the sum of its arguments. Having defined it, as above, we can use it like this

my_function(1, 3, 9)
## [1] 13

Neither the output nor the arguments of a function in R are restricted to numbers. Our next example function, named winners, takes two vectors as arguments and returns a vector. Its components are those components of the first input vector (x) that are larger than the corresponding components of the second input vector (y)

winners = function(x, y) {
    z = x > y
    return(x[z])
}
winners(c(1, 4, 5, 6, 2), c(2, 3, 3, 9, 2))
## [1] 4 5

Note how we used several things we learned above in this function. First, we defined the logical vector which indicates where x is larger than y. Then, we used logical indexing to return only certain components of x.

If-else statements

Our final element of R is its if-else statement. The syntax of the if statement is

if (condition) {
    statement
}

where condition is anything that has a logical value, and statement is any R statement. First R evaluates condition. If it is true, it runs statement. If it is false, nothing happens. If you want something to happen if (and only if) your condition is false, you need an if-else statement:

if (condition) {
    statement1
} else {
    statement2
}

This way, statement1 is evaluated when condition is true and statement1 when it is false. Since conditions inside the if statement return logical values, we can combine them using ands, ors or nots. The R notation for these operations is &, | and ! respectively, and to remind you what they do, here is a simple table

x y x & y (and) x | y (or) !x (not)
TRUE TRUE TRUE TRUE FALSE
TRUE FALSE FALSE TRUE FALSE
FALSE TRUE FALSE TRUE TRUE
FALSE FALSE FALSE FALSE TRUE

Let’s put what we learned about functions and if-else statements together to write a function distance_or_zero whose arguments are coordinates x and y of a point in the plane, and whose output is the distance from the point (x,y) to the origin if this distance happens to be between 1 and 2, and and 0 otherwise. We will use similar functions later when we discuss Monte Carlo methods:

distance_or_zero = function(x, y) {
    distance = sqrt(x^2 + y^2)
    if (distance <= 2 & distance >= 1) {
        return(distance)
    } else {
        return(0)
    }
}
distance_or_zero(1.2, 1.6)
## [1] 2
distance_or_zero(2, 3)
## [1] 0

Additional Problems for Chapter 1

Here are several simple problems. Their goal is to give you an idea of exactly how much R is required to get started in this course.

Compute the following (your answer should be a decimal number):

  1. \(1/238746238746\)
  2. \(2^{45}\)
  3. \(3^{28}\)
  4. \(\sqrt{15}\)
  5. \(\cos(\pi/8)\)
  6. \(e^2\)
  7. \(\log(2)\) (the base is \(e\))
  8. \(\log_{10}(2)\) (the base is \(10\))
  9. \(\sqrt[3]{ \frac{1342.16-2.18}{(3 \pi + 4.12)^2}}\)

Note: some of the answers will look like this 3.14e+13. If you do not know what that means, google scientific notation or E notation.

 

  1. Define two variables \(a\) and \(b\) with values \(3\) and \(4\) and “put” their product into a variable called \(c\). Output the value of \(c\).

  2. Define two vectors \(x\) and \(y\) of length \(3\), such that the components of \(x\) are \(1,2,3\) and the components of \(y\) are \(8,9,0\). Ouput their (componentwise) sum.

  3. Define a \(2\times 2\) matrix \(A=\begin{pmatrix} 1 & 2 \\ -1 & 3 \end{pmatrix}\).

  4. Compute the matrix square \(A^2\).

 

  1. Construct a vector \(x\) which contains all numbers from \(1\) to \(100\).

  2. Construct a vector \(y\) which contains squares of all numbers between \(20\) and \(2000\).

  3. Construct a vector \(z\) which contains only those components of \(y\) whose values are between \(400,000\) and \(500,000\).

  4. Compute the average (arithmetic mean) of all the components of \(z\). There is an R function that does that for you - find it!

 

  1. Write a function that takes a numerical argument \(x\) and returns \(5\) if \(x\geq 5\) and \(x\) itself otherwise.

  2. Write a function that returns TRUE (a logical value) if its argument is between \(2\) and \(3\) and FALSE otherwise.

  3. (Extra credit) Write a function that takes two equal-sized vectors as arguments and returns the angle between them in degrees. For definiteness, the angle between two vectors is defined to be \(0\) when either one of them is \((0,0,\dots,0)\).

Endnotes

Simulation of Random Variables and Monte Carlo

In the spirit of “learn by doing”, these lecture notes contain many “Problems”, both within the sections, and at the very end of each chapter. Those within sections come with solutions and usually introduce new concepts. They often feature a Comments section right after the solution subdivided into R and Math comments focusing on the computational or conceptual features, respectively. Note that you are not expected to be able to do the problems within sections before reading their solutions and comments, so don’t worry if you cannot. It is a good practice to try, though. Problems at the end, in the Additional Problems section are (initially) left unsolved. They do not require any new ideas and are there to help you practice the skills presented before.

Simulation of some common probability distributions

… where we also review some probability along the way.

“Draw” 50 simulations from the geometric distribution with parameter \(p=0.4\).

 

rgeom(50, prob = 0.4)
##  [1] 1 0 3 4 1 2 0 0 2 2 0 1 5 0 1 0 2 1 1 0 2 2 2 1 0 0 1 3 2 2 1 1 1 3 5 0 1 1
## [39] 0 0 0 1 2 0 1 1 1 0 1 0

Comments (R): R makes it very easy to simulate draws from a large class of named distributions3, such as geometric, binomial, uniform, normal, etc. For a list of all available distributions, run help("distributions") Each available distribution has an R name; the uniform is unif the normal is norm and the binomial is binom, etc. If you want to simulate \(n\) draws (aka a sample of size \(n\)) from a distribution, you form a full command by appending the letter r to its R name and use \(n\) as an argument. That is how we arrived to rgeom(50) in the solution above. The additional arguments of the function rgeom have to do with the parameters of that distribution. Which parameters go with which distributions, and how to input them as arguments to rgeom or rnorm is best looked up in R’s extensive documentation. Try help("rnorm"), for example.

Comments (Math): You could spend your whole life trying to understand what it really means to “simulate” or “generate” a random number. The numbers you obtain from so-called random number generators (RNG) are never random. In fact, they are completely deterministically generated. Still, sequences of numbers obtained from (good) random number generators share so many properties with sequences of mythical truly random numbers, that we can use them as if they were truly random. For the purposes of this class, you can assume that the numbers R gives you as random are random enough. Random number generation is a fascinating topic at the intersection of number theory, probability, statistics, computer science and even philosophy, but we do not have the time to cover any of it in this class. If you want to read a story about a particularly bad random number generator, go here.

You might have encountered a geometric distribution before. A random variable with that distribution can take any positive integer value or \(0\), i.e., its support is \({\mathbb N}_0=\{0,1,2,3,\dots\}\). As you can see from the output above, the value \(0\) appears more often than the value \(3\), and the value \(23\) does not appear at all in this particular simulation run. The probability of seeing the value \(k\in \{0,1,2,3,\dots\}\) as a result of a single draw is given by \((1-p)^k p\), where \(p\) is called the parameter of the distribution.

That corresponds to the following interpretation of the geometric distribution: keep tossing a biased coin (with probability p of obtaining H) until you see the first H; the number Ts before that is that value your geometric random variable4 If we put these probabilities in a single table (and choose \(p=0.4\), for example) it is going to look like this:
0 1 2 3 4 5 6 7
Prob. 0.4 0.24 0.144 0.086 0.052 0.031 0.019 0.011

Of course, the possible values our random variable can take do not stop at \(7\). In fact, there are infinitely many possible values, but we do not have infinite space. Note that even though the value \(23\) does not appear in the output of the command rgeom above, it probably would if we simulated many more than \(50\) values. Let’s try it with \(500\) draws - the table below counts how many \(0s\), \(1s\), \(2s\), etc. we got:

0 1 2 3 4 5 6 7 8 9 10
208 132 62 43 23 16 8 3 2 1 2

Still no luck, but we do observe values above 5 more often. By trial and error, we arrive at about \(1,000,000\) as the required number of simulations:

0 1 2 3 23 24 25 26
400616 238946 144274 86489 3 3 3 3

Compute the probability that among \(1,000,000\) draws of a geometric random variable with parameter \(p=0.4\), we never see a number greater than \(22\).

First, we compute the probability that the value seen in a single draw does not exceed \(22\):

pgeom(22, prob = 0.4)
## [1] 0.9999921

Different draws are independent of each other, so we need to raise this to the power \(1,000,000\).

(pgeom(22, prob = 0.4))^(1000000)
## [1] 0.0003717335

Comments (R): The command we used here is pgeom which is a cousin of rgeom. In general, R commands that involve named probability distributions consist of two parts. The prefix, i.e., the initial letter (p in this case) stands for the operation you want to perform, and the rest is the R name of the distribution. There are 4 prefixes, and the commands they produce are

Prefix Description
r Simulate random draws from the distribution.
p Compute the cumulative probability distribution function (cdf) (NOT pdf)
d Compute the probability density (pdf) or the probability mass function (pmf)
q Compute the quantile function

(see the Math section below for the reminder of what these things are). In this problem, we are dealing with a geometric random variable \(X\), which has a discrete distribution with support \(0,1,2,3,\dots\). Therefore, the R name is geom. We are interested in the probability \({\mathbb{P}}[ X\leq 22]\), which corresponds to the cdf of \(X\) at \(x=22\), so we use the the prefix p. Finally, we used the named parameter p and gave it the value p = 0.4, because the geometric distribution has a single parameter \(p\).

This problem also gives us a chance to discuss precision. As you can see, the probability of a single draw not exceeding \(22\) is very close to \(1\). In fact, it is equal to it to 5 decimal places. By default, R displays 7 significant digits of a number. That is enough for most applications (and barely enough for this one), but sometimes we need more. For example, let’s try to compute the probability of seeing no T (tails) in 10 tosses of a biased coin, where the probability of H (heads) is 0.9.

1 - 0.1^10
## [1] 1

While very close to it, this probability is clearly not equal to \(1\), as suggested by the output above. The culprit is the default precision. We can increase the precision (up to \(22\) digits) using the options command

options(digits = 17)
1 - 0.1^10
## [1] 0.99999999989999999

Precision issues like this one should not appear in this course, but they will out there “in the wild”, so it might be a good idea to be aware of them.

Comments (Math): If you forgot all about pdfs, cdfs and such things here is a little reminder:

cdf \(F(x) = {\mathbb{P}}[X\leq x]\)
pdf \(f(x)\) such that \({\mathbb{P}}[X \in [a,b]] = \int_a^b f(x) \, dx\) for all \(a<b\)
pmf \(p(x)\) such that \({\mathbb{P}}[X=a_n] = p(a_n)\) for some sequence \(a_n\)
qf \(q(p)\) is a number such that \({\mathbb{P}}[ X \leq q(p)] = p\)

Those random variables that admit a pdf are called continuous. The prime examples are the normal, or the exponential distribution. The ones where a pmf exists are called discrete. The sequence \(a_n\) covers all values that such a, discrete, random variable can take. Most often, \(a_n\) either covers the set of all natural numbers \(0,1,2,\dots\) or a finite subset such as \(1,2,3,4,5,6\).

Coming back to our original problem, we note that the probability we obtained is quite small. Since \(1/0.000372\) is about \(2690\), we would have to run about \(2690\) rounds of \(1,000,000\) simulations before the largest number falls below \(23\).

Compute the \(0.05\), \(0.1\), \(0.4\), \(0.6\) and \(0.95\) quantiles of the normal distribution with mean \(1\) and standard deviation \(2\).

 

qnorm(c(0.05, 0.1, 0.4, 0.6, 0.95), mean = 1, sd = 2)
## [1] -2.2897073 -1.5631031  0.4933058  1.5066942  4.2897073

Comments (R): The function we used is qnorm, with the prefix q which computes the quantile function and the R name norm because we are looking for the quantiles of the normal distribution. The additional (named) parameters are where the parameters of the distribution come in (the mean and the standard variation) in this case. Note how we plugged in the entire vector c(0.05, 0.1, 0.4, 0.6, 0.98) instead of a single value into qnorm. You can do that because this function is vectorized. That means that if you give it a vector as an argument, it will “apply itself” to each component of the vector separately, and return the vector of results. Many (but not all) functions in R are vectorized5.

As a sanity check, let’s apply pnrom (which computes the cdf of the normal) to these quantile values:

p = qnorm(c(0.05, 0.1, 0.4, 0.6, 0.95), mean = 1, sd = 2)
pnorm(p, mean = 1, sd = 2)
## [1] 0.05 0.10 0.40 0.60 0.95

As expected, we got the original values back - the normal quantile function and its cdf are inverses of each other.

Comments (Math): Computing the cdf of a standard normal is the same thing reading a normal table. Computing a quantile is the opposite; you go into the middle of the table and find your value, and then figure out which “Z” would give you that value.

Simulate \(60\) throws of a fair \(10\)-sided die.

 

sample(1:10, 60, replace = TRUE)
##  [1]  2  8  9  8  4  7  7  7  2  3  3 10  6  1  9  7  4  7  6  2  2  3 10  1  9
## [26]  7  3  2  8  4  1  2  8  1  4  9  1  9 10 10  6  1  8  6  1 10  5  1  6  9
## [51]  8  3  8  9  4  6  1  6  7  8

Comments (Math): Let \(X\) denote the outcome of a single throw of a fair \(10\)-sided die. The distribution of \(X\) is discrete (it can only take the values \(1,2,\dots, 10\)) but it is not one of the more famous named distributions. I guess you could call it a discrete uniform on \({1,2,\dots, 10}\), but a better way to describe such distribution is by a distribution table, which is really just a list of possible values a random variable can take, together with their, respective, probabilities. In this case,

1 2 3 4 5 6 7 8 9 10
0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

Comments (R): The command used to draw a sample from a (finite) collection is, of, course sample. The first argument is a vector, and it plays the role of the “bag” from which you are drawing. If we are interested in repeated, random samples, we also need to specify replace = FALSE otherwise, you could draw any single number at most once:

sample(1:10, 8, replace = FALSE)
## [1]  1  5  6  7  8 10  3  4

With more than 10 draws, we would run out of numbers to draw:

sample(1:10, 12, replace = FALSE)
## Error in sample.int(length(x), size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'

The bag you draw from can contain objects other than numbers:

sample(c("Picard", "Data", "Geordi"), 9, replace = TRUE)
## [1] "Picard" "Data"   "Geordi" "Geordi" "Data"   "Data"   "Picard" "Data"  
## [9] "Geordi"
So far, each object in the bag had the same probability of being drawn. You can use the sample command to produce a weighted sample, too. For example, if we wanted to simulate \(10\) draws from the following distribution
1 2 3
0.2 0.7 0.1

we would use the additional argument prob:

sample(c(1, 2, 3), 10, replace = TRUE, prob = c(0.2, 0.7, 0.1))
##  [1] 1 2 2 1 1 2 2 3 2 2

Note how it is mostly \(2\)s, as expected.

Draw a sample of size \(n=10\) from \(N(1,2)\), i.e., from the normal distribution with parameters \(\mu=1\), \(\sigma = 2\). Plot a histogram of the obtained values. Repeat for \(n=100\) and \(n=100000\).

 

x = rnorm(10, mean = 1, sd = 2)
hist(x)

x = rnorm(100, mean = 1, sd = 2)
hist(x)

x = rnorm(100000, mean = 1, sd = 2)
hist(x)

Comments (R): It cannot be simpler! You use the command hist, feed it a vector of values, and it produces a histogram. It will even label the axes for you. If you want to learn how to tweak various features of your histogram, type ?hist.

Esthetically, the built-in histograms leave something to be desired. We can do better, using the package ggplot2. You don’t have to use it in this class, but if you want to, you install it first by running install.packages("ggplot2") (you have to do this only once). Then, every time you want to use it, you run library(ggplot2) to notify R that you are about to use a function from that package. It would take a whole semester to learn everything there is to know about ggplot2; I will only show what a histogram looks like in it:

library(ggplot2)
z = rnorm(100000, mean = 1, sd = 2)
ggplot(data=as.data.frame(z), aes(x=z))+
   geom_histogram(bins=50, fill="white", color="DarkRed")

Comments (Math):. Mathematically, histogram can be produced for any (finite) sequence of numbers: we divide the range into several bins, count how many of the points in the sequence falls into each bin, and then draw a bar above that bin whose height is equal (or proportional to) that count. The picture tells use about how the sequence we started from is “distributed”. The order of the points does not matter - you would get exactly the same picture if you sorted the points first. If the sequence of points you draw the histogram of comes from, say, normal distribution, the histogram will resemble the shape of the pdf of a normal distribution. I say resemble, because its shape is ultimately random. If the number of points is small (like in the second part of this problem) the histogram may look nothing like the normal pdf. However, when the number of points gets larger and larger, the shape of the histogram gets closer and closer to the underlying pdf (if it exists). I keep writing “shape” because the three histograms above have very different scales on the \(y\) axis. That is because we used counts to set the vertical sizes of bins. A more natural choice is to use the proportions, i.e. relative frequencies (i.e. counts divided by the total number of points) for bar heights. More precisely, the bar height \(h\) over the bin \([a,b]\) is chosen so that the area of the bar, i.e., \((b-a)\times h\) equals to the proportion of all points that fall inside \([a,b]\). This way, the total area under the histogram is always \(1\). To draw such a density histogram in R we would need to add the additional option freq = FALSE to hist:

x = rnorm(100000, mean = 1, sd = 2)
hist(x, freq = FALSE)

Note how the \(y\)-axes label changed from “Frequency” to “Density”. With such a normalization, the histogram of \(x\) can be directly compared to the probability density of a normal distribution. Here is a histogram of \(100,000\) simulations from our normal distribution with its density function (pdf) superimposed:

sims = rnorm(10000, mean = 1, sd = 2)
x = seq(-6, 8, by = 0.02)
y = dnorm(x, mean = 1, sd = 2)
hist(sims, freq = FALSE, main = "")
points(x, y, type = "l", lwd = 3, col = "red")

Multivariate Distributions

Let x contain \(2,000\) draws from \(N(0,1)\), z another \(2,000\) draws from \(N(0,1)\) and let y=x^2+z.

  1. Draw a scatterplot of x and y to visualize the joint distribution of x and y

  2. Plot two histograms, one of x and one of y. Do they tell the whole story about the joint distribution of x and y?

  3. Are x and y correlated? Do x and y in your plot “look independent”? Use the permutation test to check of independence between x and y.

 

x = rnorm(2000)
z = rnorm(2000)
y = x^2 + z
plot(x, y)

hist(x)
hist(y)

No, the two histograms would not be enough to describe the joint distribution. There are many ways in which two random variables \(X\) and \(Y\) can be jointly distributed, but whose separate (marginal) distributions match the histograms above. To give a very simple example, let \(X\) and \(Y\) be discrete random variables, each of which can only take values \(0\) or \(1\). Consider the following two possible joint distribution tables for the random pair \((X,Y)\):

0 1
0 0.25 0.25
1 0.25 0.25
0 1
0 0.5 0.0
1 0.0 0.5

In both cases, the marginals are the same, i.e., both \(X\) and \(Y\) are equally likely to take the value \(0\) or \(1\), i.e., they both have the Bernoulli distribution with parameter \(p=1/2\). That would correspond to the separate histograms to be the same. On the other hand, their joint distributions (aka dependence structures) are completely different. In the first (left) case, \(X\) and \(Y\) are independent, but in the second they are completely dependent.

They are probably not correlated since the sample correlation between x and y is close to \(0\):

(cor(x, y))
## [1] -0.02880239

but they do not look independent.

To apply the permutation test, we first plot the scatterplot of x vs. y as above. Then, we replace y by a vector with the same components, but randomly permute their positions, and then plot a scatterplot again. We repeat this three times:

y_perm_1 = sample(y)
y_perm_2 = sample(y)
y_perm_3 = sample(y)
plot(x, y)
plot(x, y_perm_1)
plot(x, y_perm_2)
plot(x, y_perm_3)

The conclusion is clear, the first (upper-left) plot is very different than the other three. Therefore, x and y are probably not independent.

Comments (Math): The point of this problem is to review the notion of the joint distribution between two random variables. The most important point here is that there is more to the joint distribution of two random vectors, than just the two distributions taken separately. In a sense, the whole is (much) more than the sum of its parts. This is something that does not happen in the deterministic world. If you give me the \(x\)-coordinate of a point, and, separately, its \(y\)-coordinate, I will be able to pinpoint the exact location of that point.

On the other hand, suppose that the \(x\)-coordinate of a point is unknown, so we treat it as a random variable, and suppose that this variable admits the standard normal distribution. Do the same for \(y\). Even with this information, you cannot say anything about the position of the point \((x,y)\). It could be that the reason we are uncertain about \(x\) and the reason we are uncertain about \(y\) have nothing to do with each other; in that case we would be right to assume that \(x\) and \(y\) are independent. If, on the other hand, we got the values of both \(x\) and \(y\) by measuring them using the same, inaccurate, tape measure, we cannot assume that the errors are independent. It is more likely that both \(x\) and \(y\) are too big, or both \(x\) and \(y\) are too small.

Mathematically, we say that random variables \(X\) and \(Y\) are independent if \[{\mathbb{P}}[X \in [a,b]] \times {\mathbb{P}}[ Y \in [c,d] ] = {\mathbb{P}}[ X\in [a,b] \text{ and } Y\in [c,d]]\text{ for all } a,b,c,d.\] While up to the point, this definition is not very eye-opening, or directly applicable in most cases. Intuitively, \(X\) and \(Y\) are independent if the distribution of \(Y\) would not change if we received additional information about \(X\). In our problem, random variables \(X\) and \(Y\) correspond to vectors x and y. Their scatterplot above clearly conveys the following message: when x is around \(-2\), we expect y to be around 4, while when x is around \(0\), y would be expected to be around \(0\), too.

Sometimes, it is not so easy to decide whether two variables are independent by staring at a scatterplot. What would you say about the scatterplot below? The permutation test is designed to help you decide when two (simulated) random variables are likely to be independent. The idea is simple. Suppose that x and y are simulations from two independent (not necessarily identical) distributions; say x=runif(1000) and y=rnorm(1000). The vector y_perm=sample(y) is a randomly permuted version of y (see R section below) and it contains exactly the same information about the distribution of y as y itself does. Both y and y_perm will produce exactly the same histogram. Permuting y, however, “uncouples” it from x. If there was any dependence between the values of x and y before, there certainly isn’t any now. In other the joint distribution of x and y_perm has the same marginals as the joint distribution of x and y, but all the (possible) dependence has been removed. What remains is to compare the scatterplot between x and y and the scatterplot between x and y_perm. If they look about the same, we conclude that x and y are independent. Otherwise, there is some dependence between them.

One question remains: why did we have to draw three scatterplots of permuted versions of y? That is because we have only finitely many data points, and it can happen, by pure chance, that the permutation we applied to y does not completely scramble its dependence on x. With a “sample” of three such plots, we get a better feeling for the inherent randomness in this permutation procedure, and it is much easier to tell whether “one of these things is not like the others”. Btw, the random variables in the scatterplot above are, indeed, independent; here are the \(4\) permutation-test plots to “prove” it:

Unlike univariate (one-variable) distributions which are visualized using histograms or similar plots, multivariate (several-variable) distributions are harder to depict. The most direct relative of the histogram is a 3d histogram. Just like the \(x\)-axis is divided into bins in the univariate case, in the multivariate case we divide the \(xy\)-plane into regions (squares, e.g.) and count the number of points falling into each of these regions. After that a 3d bar (a skyscraper) is drawn above each square with the height of each skyscraper equal (or proportional) to the number of points which fall into its base. Here is a 3d histogram of our original pair (x,y) from the problem. You should be able to rotate and zoom it right here in the notes, provided your browser has JavaScript enabled:

## Warning: `includeHTML()` was provided a `path` that appears to be a complete HTML document.
## ✖ Path: pics/3dhist.html
## ℹ Use `tags$iframe()` to include an HTML document. You can either ensure `path` is accessible in your app or document (see e.g. `shiny::addResourcePath()`) and pass the relative path to the `src` argument. Or you can read the contents of `path` and pass the contents to `srcdoc`.

A visualization solution that requires less technology would start the same way, i.e., by dividing the \(xy\) plane into regions, but instead of the third dimension, it would use different colors to represent the counts. Here is an example where the regions are hexagons, as opposed to squares; it just looks better, for some reason:

Just to showcase the range of possibilities, here is another visualization technique which which requires deeper statistical tools, namely the density contour plot:

Comments (R): There is very little new R here. You should remember that if x and y are vectors of the same length, plot(x,y) gives you a scatterplot of x and y.

To compute the sample correlation between two vectors, use the cor.

We used the command sample(y) to obtain a randomly permuted version of y. The simplicity of this is due to default parameters of the command sample which we already learned about. In particular, the default number of samples is exactly the size of the input vector y and, by default, sampling is performed without replacement. If you think about it for a second, you will realize that a sample of size \(n\) from the vector of size \(n\) without replacement is nothing by a random permutation of y.

You are not required to do this in your submissions, but if you want to display several plots side-by-side, use the command par(mfrow=c(m,n)) before the command plot. It tells R to plot the next \(mn\) plots in a \(m\times n\) grid.

Let the random variables \(X\) and \(Y\) have the joint distribution given by the following table:

1 2 3
1 0.1 0.2 0.3
2 0.2 0.2 0.0

Simulate \(10,000\) draws from the distribution of \((X,Y)\) and display a contingency table of your results.

 

joint_distribution_long = data.frame(
   x = c(1, 1, 1, 2, 2, 2),
   y = c(1, 2, 3, 1, 2, 3)
)
probabilities_long = 
       c(0.1, 0.2, 0.3, 0.2, 0.2, 0.0)

sampled_rows = sample(
   x = 1:nrow(joint_distribution_long),
   size = 10000,
   replace = TRUE,
   prob = probabilities_long
)

draws = joint_distribution_long[sampled_rows, ]

table(draws)
##    y
## x      1    2    3
##   1  962 2027 3047
##   2 1945 2019    0

Comments (Math): The main mathematical idea is to think of each pair of possible values of \(X\) and \(Y\) as a separate “object”, put all these objects in a “bag”, then then draw from the bag. In other words, we convert the bivariate distribution from the problem to the following univariate distribution

(1,1) (1,2) (1,3) (2,1) (2,2) (2,3)
0.1 0.2 0.3 0.2 0.2 0

and sample from it. When you do, you will get a vector whose elements are pairs of numbers. The last step is to extract the components of those pairs into separate vectors.

Comments (R): The most important new R concept here is data.frame. You should think of it as a spreadsheet. It is, mathematically, a matrix, but we do not perform any mathematical operations on it. Moreover, not all columns in the data frame have to be numeric. Some of them can be strings, and other can be something even more exotic. You should think of a data frame as a bunch of column vectors of the same length stacked side by side. It is important to note that each column of a data frame will have a name, so that we don’t have to access it by its position only (as we would have to in the case of a matrix).

In this class, the column vectors of data frames are going to contain simulated values. In statistics, it is data that comes in data frames, with rows corresponding to different observations, and columns to various observed variables.

The easiest way to construct a data frame using already existing vectors is as follows:

x = c(1, 2, 3)
y = c("a", "b", "c")
(df = data.frame(x, y))
##   x y
## 1 1 a
## 2 2 b
## 3 3 c

Note that the two columns inherited their names from the vectors x and y that fed into them. Note, also, that all rows got consecutive numerical values as names by default. Row names are sometimes useful to have, but are in general a nuisance and should be ignored (especially in this class). Column names are more important, and there is a special notation (the dollar-sign notation) that allows you to access a column by its name:

df$y
## [1] "a" "b" "c"

If you want to give your columns custom names (or if you are building them out of explicitly given vectors as in the solution above) use the following syntax

z = c("a", "b", "c", "d")
(df = data.frame(letters = z, numbers = c(1, 2, 3, 4)))
##   letters numbers
## 1       a       1
## 2       b       2
## 3       c       3
## 4       d       4

A feature that data frames share with vectors and matrices is that you can use vector indexing as in the following example (where df is as above)

df[c(2, 4, 4, 1), ]
##     letters numbers
## 2         b       2
## 4         d       4
## 4.1       d       4
## 1         a       1

Make sure you understand why the expression inside the brackets is c(2,4,4,1), and not c(2,4,4,1). R’s desire to keep row names unique leads to some cumbersome constructs such as 4.1 above. As I mentioned before, just disregard them.

A nice thing about data frames is that they can easily be pretty-printed in RStudio. Go to the Environment tab in one of your RStudio panes, and double click on the name of the data frame you just built. It will appear as a nicely formatted spreadsheet.

Once we have the data frame containing all \(6\) pairs of possible values \(X\) and \(Y\) can take (called joint_distribution_long in the solution above), we can proceed by sampling from its rows, by sampling from the set 1,2,3,4,5,6 with probabilities 0.1, 0.2, 0.3, 0.2, 0.2, 0.0. The result of the corresponding sample command will be a sequence - called sampled_rows in the solution - of length \(10,000\) composed of numbers \(1,2,3,4,5\) or \(6\). The reason we chose the name sampled_rows is because each number corresponds to a row from the data frame joint_distribution_long, and by indexing joint_distribution_long by sampled_rows we are effectively sampling from its rows. In other words, the command joint_distribution_long[sampled_rows, ] turns a bunch of numbers into a bunch of rows (many of them repeated) of the data frame joint_distribution_long.

The final step is to use the function table. This time, we are applying it to a data frame and not to a vector, but the effect is the same. It tabulates all possible combinations of values of the columns, and counts how many times each of them happened. The same result would have been obtained by calling table(draws$x, draws$y).

Monte Carlo

Use Monte Carlo to estimate the expected value of the exponential random variable with parameter \(\lambda= 4\) using \(n=10\), \(n=1,000\) and \(n=1,000,000\) simulations. Compare to the exact value.

 

x = rexp(10, rate = 4)
mean(x)
## [1] 0.1779768

For an exponential random variable with parameter \(\lambda\), the expected value is \(1/\lambda\) (such information can be found in Appendix A) which, in this case, is \(0.25\). The error made was 0.072023 for \(n=10\) simulations.

We increase the number of simulations to \(n=1000\) and get a better result

x = rexp(1000, rate = 4)
mean(x)
## [1] 0.2564643

with (smaller) error -0.0064643. Finally, let’s try \(n=1,000,000\):

x = rexp(1000000, rate = 4)
mean(x)
## [1] 0.250381

The error is even smaller -0.00038101.

Comments (R): The only new thing here is the command mean which computes the mean of a vector.

Comments (Math): There is a lot going on here conceptually. This is the first time we used the Monte Carlo method. It is an incredibly useful tool, as you will keep being reminded throughout this class. The idea behind it is simple, and it is based on the Law of large numbers:

Theorem Let \(X_1,X_2, \dots\) be an independent sequence of random variables with the same distribution, for which the expected value can be computed. Then \[ \tfrac{1}{n} \Big( X_1+X_2+\dots+X_n\Big) \to {\mathbb{E}}[X_1] \text{ as } n\to\infty\] The idea behind Monte Carlo is to turn this theorem “upside down”. The goal is to compute \({\mathbb{E}}[X_1]\) and use a supply of random numbers, each of which comes from the same distribution, to accomplish that. The random number generator inside rexp gives us a supply of numbers (stored in the vector x) and all we have to do is compute their average. This gives us the left-hand side of the formula above, and, if \(n\) is large enough, we hope that this average does not differ too much from its theoretical limit. As \(n\) gets larger, we expect better and better results. That is why your error above gets smaller as \(n\) increases.

It looks like Monte Carlo can only be used to compute the expected value of a random variable, which does not seem like such a bit deal. But it is! You will see in the sequel that almost anything can be written as the expected value of some random variable.

Use Monte Carlo to estimate \({\mathbb{E}}[X^2]\), where \(X\) is a standard normal random variable.

You may or may now know that when \(X\) is standard normal \(Y=X^2\) has a \(\chi^2\) distribution with one degree of freedom. If you do, you can solve the problem like this:

y = rchisq(5000, df = 1)
mean(y)
## [1] 0.9771929

If you don’t, you can do the following:

x = rnorm(5000)
y = x^2
mean(y)
## [1] 1.019852

Comments (Math+R): We are asked to compute \({\mathbb{E}}[ X^2]\), which can be interpreted in two ways. First, we can think of \(Y=X^2\) as a random variable in its own right and you can try to take draws from the distribution of \(Y\). In the case of the normal distribution, the distribution of \(Y\) is known - it happens to be a \(\chi^2\)-distribution with a single degree of freedom (don’t worry if you never heard of it). We can simulate it in R by using its R name chisq and get a number close to the exact value of \(1\).

If you did not know about the \(\chi^2\) distribution, you would not know what R name to put the prefix r in front of. What makes the simulation possible is the fact that \(Y\) is a transformation of a random variable we know how to simulate. In that case, we simply simulate the required number of draws x from the normal distribution (using rnorm) and then apply the transformation \(x \mapsto x^2\) to the result. The transformed vector y is then nothing but the sequence of draws from the distribution of \(X^2\).

The idea described above is one of main advantages of the Monte Carlo technique: if you know how to simulate a random variable, you also know how to simulate any (deterministic) function of it. That fact will come into its own a bit later when we start working with several random variables and stochastic processes, but it can be very helpful even in the case of a single random variable, as you will see in the next problem.

Let \(X\) be a standard normal random variable. Use Monte Carlo to estimate the probability \({\mathbb{P}}[ X > 1 ]\). Compare to the exact value.

The estimated probability:

x = rnorm(10000)
y = x > 1
(p_est = mean(y))
## [1] 0.1608

The exact probability and the error

p_true = 1 - pnorm(1)
(err = p_est - p_true)
## [1] 0.002144746

Comments (R): As we learned before, the symbol > is an operation, which returns a Boolean (TRUE or FALSE) value. For example:

1 > 2
## [1] FALSE
5^2 > 20
## [1] TRUE

It is vectorized:

x = c(1, 2, 4)
y = c(5, -4, 3)
x > y
## [1] FALSE  TRUE  TRUE

and recycling rules apply to it (so that you can compare a vector and a scalar, for example)

x = 1:10
x > 5
##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

Therefore, the vector y in the solution is a vector of length \(10000\) whose elements are either TRUE or FALSE; here are the first 5 rows of data frame with columns x and y from our solution:

x y
1.9493 FALSE
-1.1015 TRUE
1.0448 TRUE
-0.1384 TRUE
-0.2573 TRUE

Finally, z contains the mean of y. How do you compute a mean of Boolean values? In R (and many other languages) TRUE and FALSE have default numerical values, usually \(1\) and \(0\). This way, when \(R\) is asked to compute the sum of a Boolean vector it will effectively count the number of values which are TRUE. Similarly, the mean is the relative proportion of TRUE values.

Comments (Math): We computed the proportion of the “times” \(X>1\) (among many simulations of \(X\)) and used it to approximate the probability \({\mathbb{P}}[ X>1]\). More formally, we started from a random variable \(X\) with a normal distribution and then transformed it into another random variable, \(Y\), by setting \(Y=1\) whenever \(X>1\) and \(0\) otherwise. This is often written as follows \[ Y = \begin{cases} 1, & X>1 \\ 0, & X\leq 1.\end{cases}\] The random variable \(Y\) is very special - it can only take values \(0\) and \(1\) (i.e., its support is \(\{0,1\}\)). Such random variables are called indicator random variables, and their distribution, called the Bernoulli distribution, always looks like this:

0 1
1-p p

for some \(p \in [0,1]\). The parameter \(p\) is nothing but the probability \({\mathbb{P}}[Y=1]\).

So why did we decide to transform \(X\) into \(Y\)? Because of the following simple fact: \[ {\mathbb{E}}[ Y] = 1 \times p + 0 \times (1-p) = p.\] The expected value of an indicator is the probability \(p\), and we know that we can use Monte Carlo whenever we can express the quantity we are computing as an expected value of a random variable we know how to simulate.

Many times, simulating a random variable is easier than analyzing it analytically. Here is a fun example:

Use Monte Carlo to estimate the value of \(\pi\) and compute the error.

 

nsim = 1000000
x =  runif(nsim,-1, 1)
y =  runif(nsim,-1, 1)
z = (x ^ 2 + y ^ 2) < 1
(pi_est = 4 * mean(z))
## [1] 3.141728
(err = pi_est - pi)
## [1] 0.0001353464

Comments (Math): As we learned in the previous problem, probabilities of events can be computed using Monte Carlo, as long as we know how to simulate the underlying indicator random variable. In this case, we want to compute \(\pi\), so we would need to find a “situation” in which the probability of something is \(\pi\). Of course, \(\pi>1\), so it cannot be a probability of anything, but \(\pi/4\) can, and computing \(\pi/4\) is as useful as computing \(\pi\). To create the required probabilistic “situation” we think of the geometric meaning of \(\pi\), and come up with the following scheme. Let \(X\) and \(Y\) be two independent uniform random variables each with values between \(-1\) and \(1\). We can think of the pair \((X,Y)\) as a random point in the square \([-1,1]\times [-1,1]\). This point will sometimes fall inside the unit circle, and sometimes it will not. What is the probability of hitting the circle? Well, since \((X,Y)\) is uniformly distributed everywhere inside the square, this probability should be equal to the portion of the area of our square which belongs to the unit circle. The area of the square is \(4\) and the area of the circle is \(\pi\), so the required probability is \(\pi/4\). Using the idea from the previous problem, we define the indicator random variable \(Z\) as follows \[ Z = \begin{cases} 1 & (X,Y) \text{ is inside the unit circle, } \\ 0 & \text{ otherwise.} \end{cases} = \begin{cases} 1& X^2+Y^2 < 1, \\ 0 & \text{ otherwise.} \end{cases}\]

  1. Write an R function cumavg which computes the sequence of running averages of a vector, i.e., if the input is \(x=(x_1,x_2,x_3,\dots, x_n)\), the output should be \[ \Big(x_1, \frac{1}{2} (x_1+x_2), \frac{1}{3}(x_1+x_2+x_3), \dots, \frac{1}{n} (x_1+x_2+\dots+x_n)\Big).\] Test it to check that it really works. (Hint: look up the function cumsum. )

  1. Apply cumavg to the vector \(4 z\) from the previous problem and plot your results (use a smaller value for nsim. Maybe \(1000\).) Plot the values against their index. Add a red horizontal line at the level \(\pi\). Rerun the same code (including the simulation part) several times.

 

  1. Once you know about the function cumsum, the problem becomes much easier.
cumavg = function(x) {
    c = cumsum(x)
    n = 1:length(x)
    return(c/n)
}
x = c(1, 3, 5, 3, 3, 9)
cumavg(x)
## [1] 1 2 3 3 3 4
nsim = 1000
x =  runif(nsim,-1, 1)
y =  runif(nsim,-1, 1)
z = (x ^ 2 + y ^ 2) < 1
pi_est = cumavg(4 * z)
plot(
   1:nsim,
   pi_est,
   type = "l",
   xlab = "number of simulations",
   ylab = "estimate of pi",
   main = "Computing pi by Monte Carlo"
)
abline(pi, 0,
       col = "red")

Comments (R): This course is not about R graphics, but I think it is a good idea to teach you how to make basic plots in R. We already used the function plot to draw scatterplots. By default, each point drawn by plot is marked by a small circle so it might not seem like a good idea to use it. Luckily this, and many other things, can be adjusted by numerous additional arguments. One of such arguments is type which determines the type of the plot. We used type="l" which tells R to join the points with straight lines:

x = c(1, 3, 4, 7)
y = c(2, 1, 5, 5)
plot(x, y, type = "l")

The other arguments, xlab, ylab and main determine labels for axes and the entire plot. The function abline(a,b) adds a line \(y = a x + b\) to an already existing plot. It is very useful in statistics if one wants to show the regression line superimposed on the scatterplot of data. Finally, the argument col, of course, determines the color of the line. To learn about various graphical parameters, type ?par.

Comments (Math): The conceptual reason for this exercise is to explore (numerically) the kinds of errors we make when we use Monte Carlo. Unlike the deterministic numerical procedures, Monte Carlo has a strange property that no bound on the error can be made with absolute certainty. Let me give you an example. Suppose that you have a biased coin, with the probability \(0.6\) of heads and \(0.4\) of tails. You don’t know this probability, and use a Monte Carlo technique to estimate it - you toss your coin \(1000\) times and record the number of times you observe \(H\). The law of large numbers suggests that the relative frequency of heads is close to the true probability of \(H\). Indeed, you run a simulation

x = sample(c("T", "H"), 1000, prob = c(0.4, 0.6), replace = TRUE)
y = x == "H"
mean(y)
## [1] 0.594

and get a pretty accurate estimate of \(0.594\). If you run the same code a few more times, you will get different estimates, but all of them will be close to \(0.6\). Theoretically, however, your simulation could have yielded \(1000\) Hs, which would lead you to report \(p=1\) as the Monte-Carlo estimate. The point is that even though such disasters are theoretically possible, they are exceedingly unlikely. The probability of getting all \(H\) in \(1000\) tosses of this coin is a number with more than \(500\) zeros after the decimal point.

The take-home message is that even though there are no guarantees, Monte Carlo performs well the vast majority of the time. The crucial ingredient, however, is the number of simulations. The plot you were asked to make illustrates exactly that. The function cumavg gives you a whole sequence of Monte-Carlo estimates of the same thing (the number \(\pi\)) with different numbers of simulations nsim. For small values of nsim the error is typically very large (and very random). As the number of simulations grows, the situations stabilizes and the error decreases. Without going into the theory behind it, let me only mention is that in the majority of practical applications we have the following relationship: \[ error \sim \frac{1}{\sqrt{n}}.\] In words, if you want to double the precision, you need to quadruple the number of simulations. If you want an extra digit in your estimate, you need to multiply the number of simulations by \(100\). Here is an image where I superimposed \(40\) plots like the one you were asked to produce (the dashed lines are \(\pm \frac{4}{\sqrt{n}}\)):

Conditional distributions

Let \(X\) and \(Y\) be two independent geometric random variables with parameters \(p=0.5,\) and let \(Z=X+Y\). Compute \({\mathbb{P}}[ X = 3| Z = 5]\) using simulation. Compare to the exact value.

By simulation:

nsim = 1000000
X = rgeom(nsim, prob = 0.5)
Y = rgeom(nsim, prob = 0.5)
Z = X + Y
X_cond = X[Z == 5]
mean(X_cond  == 3) 
## [1] 0.1684758

To get the exact value, we start from the definition: \[ {\mathbb{P}}[ X = 3 | Z= 5 ] = \frac{{\mathbb{P}}[ X=3 \text{ and }Z=5]}{{\mathbb{P}}[Z=5]} = \frac{{\mathbb{P}}[X=3 \text{ and }Y = 2]}{{\mathbb{P}}[Z=5]}, \] where the last equality follows from the fact that \(\{ X=3 \text{ and } Z=5 \}\) is exactly the same event as \(\{ X = 3 \text{ and } Y=2\}\). Since \(X\) and \(Y\) are independent, we have \[{\mathbb{P}}[ X=3 \text{ and }Y=2 ] = {\mathbb{P}}[X=3] \times {\mathbb{P}}[ Y=2] = 2^{-4} 2^{-3} = 2^{-7}.\] To compute \({\mathbb{P}}[ Z = 5]\) we need to split the event \(\{ Z = 5 \}\) into events we know how to deal with. Since \(Z\) is built from \(X\) and \(Y\), we write \[ \begin{align} {\mathbb{P}}[ Z = 5 ] = &{\mathbb{P}}[X=0 \text{ and }Y=5]+ {\mathbb{P}}[ X=1 \text{ and }Y=4] + {\mathbb{P}}[ X=2 \text{ and }Y=3] + \\ & {\mathbb{P}}[ X=3 \text{ and }Y=2] + {\mathbb{P}}[ X=4 \text{ and }Y=1] + {\mathbb{P}}[ X = 5 \text{ and }Y=0]. \end{align}\] Each of the individual probabilities in the sum above is \(2^{-7}\), so \({\mathbb{P}}[ X = 3 | Z = 5] = \frac{1}{6}\). This gives us an error of 0.0018091.

Comments (Math): Let us, first, recall what the conditional probability is. The definition we learn in the probability class is the following \[ {\mathbb{P}}[A | B] = \frac{{\mathbb{P}}[A \text{ and }B]}{{\mathbb{P}}[B]},\] as long as \({\mathbb{P}}[B]>0\). The interpretation is that \({\mathbb{P}}[A|B]\) is still the probability of \(A\), but now in the world where \(B\) is guaranteed to happen. Conditioning usually happens when we receive new information. If someone tells us that \(B\) happened, we can disregard everything in the complement of \(B\) and adjust our probability to account for that fact. First we remove from \(A\) anything that belongs to the complement of \(B\), and recompute the probability \({\mathbb{P}}[A \cap B]\). We also have to divide by \({\mathbb{P}}[B]\) because we want the total probability to be equal to \(1\).

Our code starts as usual, but simulating \(X\) and \(Y\) from the required distribution, and constructing a new vector \(Z\) as their sum. The variable X_cond is new; we build it from \(X\) by removing all the elements whose corresponding \(Z\) is not equal to \(5\). This is an example of what is sometimes called the rejection method in simulation. We simply “reject” all simulations which do not satisfy the condition we are conditioning on. We can think of X_cond as bunch of simulations of \(X\), but in the world where \(Z=5\) is guaranteed to happen. Once we have X_cond, we proceed as usual by computing the relative frequency of the value \(3\) among all possible values \(X\) can take. Note that the same X_cond can also be used to compute the conditional probability \({\mathbb{P}}[ X=1| Z=5]\). In fact, X_cond contains the information about the entire conditional distribution of \(X\) given \(Z=5\); if we draw a histogram of X_cond, we will get a good idea of what this distribution looks like:

Since X_cond contains only discrete values from \(0\) to \(5\), a contingency table might be a better tool for understanding its distribution:

0 1 2 3 4 5
7745 7761 7691 7807 7731 7604

The histogram and the table above suggest that the distribution of \(X\), given \(Z=5\), is uniform on \(\{0,1,2,3,4,5\}\). It is - a calculation almost identical to the one we performed above gives that \({\mathbb{P}}[ X= i| Z=5] = \frac{1}{6}\) for each \(i=0,1,2,3,4,5\).

One more observation at the end. Note that we drew \(n=1,000,000\) simulations this time. While it is probably an overkill for this particular example, conditional probabilities in general require more simulations than unconditional ones. Of course, that is because we reject most of our original draws. Indeed, the size of the vector X_cond is 46339 - more than a \(20\)-fold decrease. This fact becomes particularly apparent when we try to use Monte Carlo for conditional distributions associated with continuous random vectors as we will see in out next problem.

Let \(X\) and \(Y\) be independent random variables where \(X\) has the \(N(0,1)\) distribution and \(Y\) the exponential distribution with parameter \(\lambda=1\). Find a graphical approximation to the conditional density of \(Y\), given \(X+Y\geq 1\). Repeat the same, but condition on \(X+Y=1\).

 

nsim = 100000
x = rnorm(nsim)
y = rexp(nsim)

cond = x + y >= 1
x_cond = x[cond]
hist(x_cond, breaks = 100)

nsim = 100000
eps = 0.1
x = rnorm(nsim)
y = rexp(nsim)
cond = (1 - eps < x + y) & (x + y < 1 + eps)
x_cond = x[cond]
hist(x_cond, breaks = 100)

Comments (Math): In the case of conditioning on \(X+Y\geq 1\) we repeated the same procedure as in the discrete case. We simply rejected all draws that do not satisfy the condition.

When Conditioning on \(X+Y=1\), however, you immediately encounter a problem that you don’t get with discrete distributions. The event \(\{ X+Y=1\}\) has probability \(0\) and will never happen. That means that our strategy form the previous problem will simply not work - you will reject all draws. The problem goes beyond a particular approach to the problem, as the conditional probabilities such as \({\mathbb{P}}[ Y \geq 0 | X+Y=1]\) are not well defined. Indeed, the formula \[ {\mathbb{P}}[ Y \geq 0 | X+Y=1] "=" \frac{{\mathbb{P}}[ Y\geq 0 \text{ and } X+Y=1]}{ {\mathbb{P}}[X+Y=1]}\] requires that the probability in the denominator be strictly positive. Otherwise you are dividing by zero. The theoretical solution to this is by no means simple and requires mathematics beyond the scope of these notes. Practically, there is a very simple way of going around it. Instead of conditioning on the zero-probability event \(X+Y=1\), we use a slightly more relaxed condition \[ X+Y \in (1-\varepsilon, 1+\varepsilon) \] for a small, but positive, \(\varepsilon\). In many cases of interest, this approximation works very well, as long as \(\varepsilon\) is not too big. How big? Well, that will depend on the particular problem, as well as on the number of simulations you are drawing. The best way is to try several values and experiment. For example, if we chose \(\varepsilon=0.01\) in our problem, the number of elements in x_cond (i.e., the number of non-rejected draws) would be on the order of \(100\), which may be considered to small to produce an accurate histogram. On the other hand, when \(\varepsilon=1\), your result will be inaccurate because you are conditioning on the event \(0 < X+Y < 2\) which is a poor approximation for \(X+Y=1\). The rule of thumb is to take the smallest \(\varepsilon\) you can, while keeping the number of non-rejected draws sufficiently large.

Additional Problems for Chapter 2

 

  1. Find the Weibull distribution in R’s help system. Simulate \(n=10000\) draws from the Weibull distribution with shape parameter \(2\) and scale parameter \(3\). Draw a histogram of your simulations.

  2. Suppose that the vector x contains \(n=10000\) simulations from the standard normal \(\mu=0, \sigma=1)\). Without simulating any new random numbers, transform it into the vector y such that y is a vector of \(n=10000\) simulations from the normal with \(\mu=1\) and \(\sigma=0.5\). Draw histograms of both x and y on the same plot. (Note: the extra parameter add is used to superimpose plots. You may want to use different colors, too. Use the parameter col for that. )

  3. Starting with x=seq(-3,3,by=0.1), define the appropriate vector y and use x and y to plot the graph of the cdf of the standard normal. The command you want to use is plot with the following extra arguments

    • type="l" (to get a smooth line instead of a bunch of points).
    • main="The CDF of the standard normal" (to set the title), and
    • another argument (which you must look up yourself) that will set the \(y\)-axis label to \(F(x)\).


  1. Simulate \(n=1000\) draws from the distribution whose distribution table is given by

    2

    4

    8

    16

    0.2

    0.3

    0.1

    0.4

    Draw a histogram of your results.

  2. You may have learned in probability how to compute the pdf \(f_Y(y)\) of a transformation \(Y=g(X)\) of a random variable with pdf \(f_X(x)\). Suppose that you forgot how to do that, but have access to \(10,000\) simulations from the distribution of \(X\). How would you get an approximate idea about the shape of the function \(f_Y\)?

    More concretely, take \(X\) to be exponentially distributed with parameter \(1\) and \(g(x) = \sin(x)\) and produce a picture that approximates the pdf \(f_Y\) of \(Y\). (Note: even if you remember how to do this analytically, you will run into a difficulty. The function \(\sin(x)\) is not one-to-one and the method usually taught in probability classes will not apply. If you learned how to do it in the many-to-one case of \(g(x)= \sin(x)\), kudos to your instructor!)

  3. Let \(X\) be a random variable with the Cauchy distribution, and \(Y = \operatorname{arctan}(X)\). R allows you to simulate from the Cauchy distribution, even if you do not know what it is. How would you use that to make an educated guess as to what the distribution of \(Y\) is? To make your life easier, consider \(\tfrac{2}{\pi} Y\) first.

Click for Solution


  1. x = sample(c(2, 4, 8, 16), size = 10000, prob = c(0.2, 0.3, 0.1, 0.4), replace = TRUE)
    hist(x)

    Note: given that we are dealing with a discrete distribution, a contingency table might be a better choice:

    2

    4

    8

    16

    2014

    3052

    951

    3983

  2. We apply the function \(\sin\) to the simulations. The histogram of the obtained values is going to be a good (graphical) approximation to the pdf of the transformed random variable:

    x = rexp(100000)
    y = sin(x)
    hist(y)

  3. Having learned that histograms look like the pdfs of the underlying distributions, we draw the histogram:

    x = rcauchy(10000)
    y = atan(x) * 2/pi
    hist(y)

    It looks uniform (if we replace \(10,000\) by \(100,000\)&_t + it will look even more uniform). We conclude that \(2/\pi \arctan(X)\) is probably uniformly distributed on \((-1,1)\). Hence, \(Y = \arctan(X)\) is probably uniformly distributed on \((-\pi/2, \pi/2)\).

A basic method for obtaining simulations draws from distributions other than the uniform is the transformation method. The idea is to start with (pseudo) random numbers, i.e., draws from the uniform \(U(0,1)\) distribution, and then apply a function \(g\) to each simulation. The difficulty is, of course, how to choose the right function \(g\).
Let \(X\) be a random variable with a continuous and strictly increasing cdf \(F\). What is the distribution of \(Y=F (X)\)? What does that have to do with the transformation method?

(Hint: if you are having difficulty with this problem, feel free to run some experiments using R. )

Click for Solution

Let us perform an experiment where \(X \sim N(0,1)\). Remembering that the cdf is given by the R function pnorm:

x = rnorm(100000)
y = pnorm(x)
hist(y)

This looks like a histogram of a uniform distribution on \((0,1)\). Let’s try with some other continuous distributions

x1 = rexp(100000)
x2 = rcauchy(100000)
x3 = runif(100000)
x4 = rgamma(100000, shape = 3)
par(mfrow = c(2, 2))
hist(pexp(x1))
hist(pcauchy(x2))
hist(punif(x3))
hist(pgamma(x4, shape = 3))

All of those point to the same conjecture, namely that \(F(X)\) is uniformly distributed on \((0,1)\). To prove that, we take \(Y=F(X)\) and try to compute that cdf \(F_Y\) of \(Y\): \[F_Y(y) = {\mathbb{P}}[ Y \leq y] = {\mathbb{P}}[ F(X) \leq y]\] Since \(F\) is strictly increasing, it admits an inverse \(F^{-1}\). Moreover, for any \(y \in (0,1)\), the set of all values of \(x\) such that \(F(x)\leq y\) (the red range) is exactly the interval \((-\infty, F^{-1}(y)]\) (the blue range), as in the picture below:

Hence, \[F_Y(y)={\mathbb{P}}[Y\leq y] = {\mathbb{P}}[ F(X) \leq y] = {\mathbb{P}}[ X \leq F^{-1}(y) ] = F(F^{-1}(y)) = y, \text{ for } y\in (0,1).\] The cdf \(F_Y\) is, therefore, equal to the cdf of a uniform on \((0,1)\). Since the cdf uniquely determines the distribution, \(Y\) must be uniformly distributed on \((0,1)\).


  1. Let \(f_1\) and \(f_2\) be two pdfs. We take a constant \(\alpha \in (0,1)\) and define the function \(f\) by \[ f(x) = \alpha f_1(x) + (1-\alpha) f_2(x).\] The function \(f\) is the pdf of a third distribution, which is called the mixture of \(f_1\) and \(f_2\) with weights \(\alpha\) and \(1-\alpha\). Assuming that you know how to simulate from the distributions with pdfs \(f_1\) and \(f_2\), how would you draw \(10,000\) simulations from the mixture \(f\)? Show your method on the example of a mixture of \(N(0,1)\) and \(N(4,1)\) with \(\alpha=2/3\). Plot the histogram of the obtained sample (play with the parameter breaks until you get a nice picture.)

    (Hint: start with two vectors, the first containing \(10,000\) simulations from \(f_1\) and the second from \(f_2\). Then “toss” \(10,000\) biased coins with \(\mathbb{P}[ H ] = \alpha\) … )

  2. The double exponential or Laplace distribution is a continuous probability distribution whose pdf is given by \[ \tfrac{1}{2} \exp(-|x|), x\in {\mathbb R}.\] This distribution is not built into R. How would you produce simulations from the double exponential using R?

Click for Solution


  1. The idea is that before each draw a biased coin (with \({\mathbb{P}}[H]=\alpha\)) is tossed. If \(H\) is obtained, we draw from the distribution with pdf \(f_1\). Otherwise, we draw from the distribution with pdf \(f_2\). We write a function which performs one such simulation, and then use the command replicate to call it several times and store the results in the vector:

    single_draw = function() {
        coin = sample(c(1, 2), prob = c(2/3, 1/3), size = 1, replace = TRUE)
        if (coin == 1)
            return(rnorm(1)) else return(rnorm(1, mean = 4, sd = 1))
    }
    
    nsim = 20000
    y = replicate(nsim, single_draw())
    
    hist(y)

    As you can see, the histogram has two “humps”, one centered around \(0\) and the other centered around \(4\). The first one is taller, which reflects the higher weight (\(\alpha=2/3\)) that \(N(0,1)\) has in this mixture.

    If you wanted to write a more succinct vectorized code (which is not necessarily faster in this case), you could also do something like this

    nsim = 10000
    alpha = 2/3
    x1 = rnorm(nsim)
    x2 = rnorm(nsim, mean = 4, sd = 1)
    coin = sample(c(TRUE, FALSE), size = nsim, prob = c(alpha, 1 - alpha), replace = TRUE)
    y = ifelse(coin, x1, x2)

    The function ifelse is a vectorized version of the if-then blok and takes three arguments of equal length. The first one is a vector of logical values c, and the other two, x1, x2 only need to be of the same type. The result of is a vector whose value at the position i is x1[i] if c[i]==TRUE and x2[i] otherwise.

  2. The Laplace distribution can be understood as a mixture, with \(\alpha=1/2\), of two distributions. The first one is an exponential, and the second one is obtained from by putting the minus sign in front of it.
    Using our strategy from part 1. above, we could get simulations of it as follows:

    nsim = 100000
    alpha = 1/2
    x1 = rexp(nsim)
    x2 = -rexp(nsim)  # note the minus sign in front of rexp
    coin = sample(c(TRUE, FALSE), size = nsim, prob = c(alpha, 1 - alpha), replace = TRUE)
    y = ifelse(coin, x1, x2)
    hist(y)

    You can do this more efficiently if you realize that every time we toss a coin and choose between x1 and x2, we are really choosing the sign in front of an exponentially distributed random variable. In other words, we can use coin as a vector of random signs for a vector or draws from the exponential distribution:

    nsim = 10000
    alpha = 1/2
    x = rexp(nsim)
    coin = sample(c(-1, 1), size = nsim, prob = c(alpha, 1 - alpha), replace = TRUE)
    y = coin * x
    hist(y)

Let x=rnorm(1000) and y=rnorm(1000). For each of the following pairs, use the permutation test to decide whether they are independent or not

  1. x^2+y^2 and y^2
  2. (x+y)/sqrt(2) and (x-y)/sqrt(2)
  3. x and 1
  4. x^2+y^2 and atan(y/x).

(Note: do not worry about dividing by \(0\) in d. It will happen with probability \(0\).)

Click for Solution

Let us start by writing a function to save some keystrokes

permutation_test = function(z, w) {
    par(mfrow = c(2, 2))
    plot(z, w, asp = 1)
    plot(z, sample(w), asp = 1)
    plot(z, sample(w), asp = 1)
    plot(z, sample(w), asp = 1)
}

x = rnorm(1000)
y = rnorm(1000)
  1. permutation_test(x^2 + y^2, y^2)

    The first plot is very different from the other three. Therefore,the vectors are probably not independent.

  2. permutation_test((x + y)/sqrt(2), (x - y)/sqrt(2))

    The first plot could easily be confused for one of the other three. Therefore the vectors are probably independent.

  3. # we have to use rep(1,length(x)) to get a vector of 1s of the same length as
    # x.  R will not recycle it properly if you simply write 1.  Another, more
    # 'hacky' way would be to take advantage of recycling and use 0*x+1
    permutation_test(x, rep(1, length(x)))

    The plots look very similar. Therefore, the vectors are probably independent. We could have known this without drawing any graphs. Anything is independent of a constant random variable (vector).

  4. permutation_test(x^2 + y^2, atan(y/x))

    Plots look very similar to each other. Therefore, z and w are probably independent.

Note: The plots in b) and d) reveal that the distribution of the random vector \((X,Y)\) consisting of two independent standard normals is probably rotation invariant. In b) we are asked to compare the coordinates of the vector obtained from \((X,Y)\) by a rotation at \(45\) degrees around the origin. The fact that independence persisted suggests that components remain independent even after a (specific) rotation. If you tried rotations by different angles you would get the same result. The experiment in d) told us that the (squared) distance \(X^2+Y^2\) and angle between \((X,Y)\) and the \(x\)-are independent. This is also something that one would expect from a rotationally-invariant distribution. Indeed, the distribution of the distance to the origin should not depend on the direction.

It is important to note that none of this proves anything. It is simply numerical evidence for a given conclusion.

Simulate \(n=10000\) draws from the joint distribution given by the following table:

1 2 3
1 0.1 0.0 0.3
2 0.1 0.1 0.0
3 0.0 0.0 0.4

Display the contingency table of your results, as well as a table showing the “errors”, i.e., differences between the theoretical frequencies (i.e., probabilities given above) and the obtained relative frequencies in the sample.

Click for Solution

We are using the procedure from Section 2.2 in the notes.

nsim = 10000

joint_distribution_long = data.frame(
   x = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
   y = c(1, 2, 3, 1, 2, 3, 1, 2, 3)
   )
probabilities_long =
   c(0.1, 0.0, 0.3,
     0.1, 0.1, 0.0,
     0.0, 0.0, 0.4)

sampled_rows = sample(
   x = 1:nrow(joint_distribution_long),
   size = nsim,
   replace = TRUE,
   prob = probabilities_long
)

draws = joint_distribution_long[sampled_rows,]

(rel_freq = prop.table(table(draws)))
##    y
## x        1      2      3
##   1 0.0943 0.0000 0.3022
##   2 0.1011 0.1023 0.0000
##   3 0.0000 0.0000 0.4001
(th_freq = matrix(probabilities_long, byrow = TRUE, nrow = 3))
##      [,1] [,2] [,3]
## [1,]  0.1  0.0  0.3
## [2,]  0.1  0.1  0.0
## [3,]  0.0  0.0  0.4
(err = rel_freq - th_freq)
##    y
## x         1       2       3
##   1 -0.0057  0.0000  0.0022
##   2  0.0011  0.0023  0.0000
##   3  0.0000  0.0000  0.0001

Estimate the following integrals using Monte Carlo

  1. \(\int_0^1 \cos(x)\, dx\)

  2. \(\int_{-\infty}^{\infty}\frac{1}{\sqrt{2\pi}}\frac{e^{-x^2/2}}{1+x^4}\,dx\)

  3. \(\int_0^{\infty} e^{-x^3-x}\, dx\)

  4. \(\int_{-\infty}^{\infty} \frac{\cos(x^2)}{1+x^2}\, dx\) (extra credit)

Click for Solution

The idea here is to use the “fundamental theorem of statistics” \[ {\mathbb{E}}[ g(X) ] = \int g(x)\, f_X(x)\, dx \] where \(f_X\) is the pdf of \(X\) and \(g\) is any reasonably well-behaved function. Normally, one would use the integral on the right to compute the expectation on the left. We are flipping the logic, and using the expectation (which we can approximate via Monte Carlo) to estimate the integral on the right.

  1. We pick \(g(x) = \cos(x)\) and \(X\) a r.v. with a uniform distribution on \((0,1)\), so that \(f_X(x) = 1\) for \(x\in (0,1)\) and \(0\) otherwise:

    nsim = 10000
    x = runif(nsim)
    y = cos(x)
    mean(y)
    ## [1] 0.8414215

    For comparison, the exact value of the integral is \(\sin(1) \approx 0.841471\).

  2. We cannot use the uniform distribution anymore, because the limits of integration are \(\pm \infty\). Part of the expression inside the integral can be recognized as a (standard) normal density, so we take \(X \sim N(0,1)\) and \(g(x) = 1/(1+x^4)\)

    nsim = 10000
    x = rnorm(nsim)
    y = 1/(1 + x^4)
    mean(y)
    ## [1] 0.6785762

    The “exact” value (i.e., very precise approximation to this integral obtained using another numerical method) is \(0.676763\).

  3. We integrate \(g(x) = \exp(-x^3)\) against the exponential pdf \(f_X(x) = \exp(-x)\), for \(x>0\):

    nsim = 10000
    x = rexp(nsim)
    y = exp(-x^3)
    mean(y)
    ## [1] 0.5690694

    A close approximation of the true value is \(0.56889\).

  4. In this case, a possible choice of the distribution for \(X\) is the Cauchy distribution (no worries if you never heard about it), whose pdf is \(f_X(x) = \frac{1}{\pi(1+x^2)}\), so that \(g(x) = \pi \cos(x^2)\):

    nsim = 10000
    x = rcauchy(nsim)
    y = pi * cos(x^2)
    mean(y)
    ## [1] 1.299918

    The “exact” value is \(1.30561\).

The tricylinder is a solid body constructed as follows: create three cylinders of radius 1 around each of the three coordinate axes and intersect them:

Image by Ag2gaeh - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=63604565

Use Monte Carlo to estimate the volume of the tricylinder and check your estimate against the exact value \(8(2-\sqrt{2})\).

Read about the Monty Hall Problem online (the introduction to its Wikipedia page has a nice description), Use Monte Carlo to compare the two possible strategies (switching and not-switching) and decide which is better.

Click for Solution

The host knows where the car is and what contestant’s guess is. If those two are the same (i.e., contestant guessed right), he will choose one of the two remaining doors at random. If not, he simply shows the contestant the other door with the goat behind it. This exactly what the function show_door implements:

show_door = function(car, guess) {
    all_doors = c(1, 2, 3)
    goat_doors = all_doors[all_doors != car]

    if (car == guess) {
        random_goat_door = sample(goat_doors, size = 1)
        return(random_goat_door)
    } else {
        the_other_goat_door = goat_doors[goat_doors != guess]
        return(the_other_goat_door)
    }
}

Next, we write a function which simulates the outcome of a single game. It will have one argument, switch which will determine whether the contestant switches the door or not.

one_game = function(switch) {
    all_doors = c(1, 2, 3)
    car = sample(all_doors, size = 1)
    guess = sample(all_doors, size = 1)

    if (switch) {
        unguessed_doors = all_doors[all_doors != guess]
        shown_door = show_door(car, guess)
        switched_guess = unguessed_doors[unguessed_doors != shown_door]
        return(switched_guess == car)
    } else {
        return(guess == car)
    }
}

Finally we run two batches of \(10,000\) simulations, one with switch=TRUE and another with switch=FALSE:

nsim = 10000
switch_doors = replicate(nsim, one_game(TRUE))
dont_switch_doors = replicate(nsim, one_game(FALSE))
(prob_with_switching = mean(switch_doors))
## [1] 0.668
(prob_without_switching = mean(dont_switch_doors))
## [1] 0.3288

Therefore, the probability of winning after switching is about double the probability of winning without switching. Switching is good for you!

(A philosophical note: this was the most “agnostic” approach to this simulation. Simulations can often be simplified with a bit of insight. For example, we could have realized that the switching strategy simply flips the correctness of the guess (from “correct” to “wrong” and vice versa) and used it to write a much shorter answer. Ultimately, we could have realized that, because the probability of the initial guess being correct is \(1/3\), switching leads to a correct guess in \(2/3\) of the cases (and not switching in only \(1/3\) of the cases). In this case, the whole code would be sample(c("correct", "incorrect"), size=10000, prob= c(2/3,1/3), replacement=TRUE), which is an extremely inefficient way to estimate the value of the number \(2/3\)!)


  1. Find the beta distribution in R’s help system. The only thing you need to know about it is that it comes with two positive parameters \(\alpha\) and \(\beta\) (called shape1 and shape2 in R). Simulate \(n=10000\) draws from the beta distribution with parameters \(\alpha=0.5\) and \(\beta=0.3\). Plot a density histogram of your simulations as well as the pdf of the underlying distribution on the same plot. (Note: if you use the command lines instead of plot for the pdf, it will be automatically added to your previous plot - the histogram, in this case).

  2. Let \(X\) be a beta-distributed random variable with parameters \(\alpha=0.5\) and \(\beta=0.3\) (as above), and let \(Y\) be an independent random variable with the same distribution. Estimate the probability \({\mathbb{P}}[ XY > 1+ \log(Y)]\). Check graphically whether \(XY\) and \(1+\log(Y)\) are independent random variables.

  3. Let \(X\), \(Y\) and \(Z\) be three independent random variables uniformly distributed on \((0,1)\), and let \(M = \min(X,Y,Z)\). A theorem in probability states that \(M\) has a beta distribution with some parameters \(\alpha\) and \(\beta\). Use simulation (and associated plots) to make an educated guess about what the values of \(\alpha\) and \(\beta\) are. (Hint: they are nice round numbers.)


  1. Find the gamma distribution in R’s help system. The only thing you need to know about it is that it comes with two positive parameters \(k\) and \(\theta\) (called shape and scale in R). Simulate \(n=10000\) draws from the gamma distribution with parameters \(k=2\) and \(\theta=1\). Plot a density histogram of your simulations as well as the pdf of the underlying distribution on the same plot. Repeat for \(3\) more choices (left to you) of the parameters \(k\) and \(\theta\).

  2. Let \(X\) be a gamma-distributed random variable with parameters \(k=2\) and \(\theta=1\) (as above), and let \(Y_1, Y_2\) be a independent standard normals (also independent of \(X\)). Estimate the probability \({\mathbb P}[ X > Y_1^2 + Y_2^2 ]\) using Monte Carlo. Check graphically whether \(XY_1\) and \(X Y_2\) are independent.

  3. Let \(X\), \(Y\) and \(Z\) be three independent random variables with the gamma distribution and the scale parameter \(\theta=1\). Their shape parameters are, however different, and equal to \(1,2\) and \(3\), respectively. Use simulation (and associated plots) to make an educated guess about the distribution of \(X+Y+Z\) and its parameters. (Hint: the parameters will be nice round numbers.)

Click for Solution


  1. hist_with_pdf <- function(k, theta) {
      x = rgamma(10000, shape = k, scale = theta)
      hist(x, 
           probability = T,  
           breaks = 50, 
           xlim = c(0,15), ylim = c(0, 0.8),
           main = paste("shape = ", k, ", scale = ", theta)
           )
      x_pdf = seq(0, 15, by = 0.01)
      y_pdf = dgamma(x_pdf, shape = k, scale = theta)
      lines(x_pdf, y_pdf,
            type = "l",
            col = "red",
            lwd = 2)
    }
    
    par(mfrow = c(2, 2))  
    hist_with_pdf(2,1)
    hist_with_pdf(2,0.5)
    hist_with_pdf(3,1)
    hist_with_pdf(3,0.5)

  2. We use Monte Carlo to estimate the probability:

    x = rgamma(10000, shape = 2, scale = 1)
    y1 = rnorm(10000)
    y2 = rnorm(10000)
    z = x > y1^2 + y2^2
    mean(z)
    ## [1] 0.5508

    and the permuation test to check for (in)dependence

    w = x * y1
    z = x * y2
    par(mfrow = c(2, 2))
    z_perm_1 = sample(z)
    z_perm_2 = sample(z)
    z_perm_3 = sample(z)
    plot(w, z)
    plot(w, z_perm_1)
    plot(w, z_perm_2)

    The common shape of plots 2, 3. and 4. is sufficiently different from the shape of plot 1. to conclude that \(XY_1\) and \(XY_2\) are most likely not independent. Note: if you were to compute the correlation between \(XY_1\) and \(XY_2\) you would get \(0\) - this is an example of uncorrelated random variables that are, nevertheless, not independent.

  3. First, we simulate \(10,000\) draws from the distribution of \(S = X + Y + Z\) and plot the histogram of obtained values:

    nsim = 10000
    x = rgamma(nsim, shape = 1, scale = 1)
    y = rgamma(nsim, shape = 2, scale = 1)
    z = rgamma(nsim, shape = 3, scale = 1)
    s = x + y + z
    hist(s, breaks = 100)

    The shape resembles the shape we obtained in 1. above, so we try the gamma distribution with various parameters. The function try_parameters below does exacly that - it superimposes the gamma pdf onto the histogram of s.

    try_parameters = function(k, theta) {
        hist(s, probability = T, breaks = 50, main = paste("Trying shape = ", k, " scale = ",
            theta))
        x_pdf = seq(0, 15, by = 0.01)
        y_pdf = dgamma(x_pdf, k, theta)
        lines(x_pdf, y_pdf, type = "l", col = "red", lwd = 2)
    }
    
    par(mfrow = c(3, 3))
    try_parameters(1, 1)
    try_parameters(2, 1)
    try_parameters(1, 2)
    try_parameters(1, 3)
    try_parameters(2, 1)
    try_parameters(4, 1)

    We conclude that \(X+Y+Z\) is most likely gamma-distributed with \(k=6\) and \(\theta=1\). This is indeed, the case. The sum of independen gammas with shape parameters \(k_1, \dots, k_n\) and the same scale parameter \(\theta\) is a gamma with parameters \(k = k_1+\dots+k_n\) and \(\theta\).

We learned how to simulate from a joint distribution of two discrete vectors \((X,Y\)) by thinking of it as one-dimensional distribution but with values represented by pairs of numbers. Here is another way this can be done:

  1. Find the marginal distribution of one of them, say \(X\), and simulate from it

  2. Given the value you just obtained, let’s call it \(x\), simulate from the conditional distribution of \(Y\), given \(X=x\).

Implement this procedure on the following joint distribution:
1 2 3
1 0.1 0.0 0.3
2 0.1 0.1 0.0
3 0.0 0.0 0.4

Display the contingency table of your simulations, first using counts, and then using relative frequencies. Compare to the theoretical values (i.e., the probabilities in the table above).

Click for Solution


margin_X = c(0.2, 0.1, 0.7 )
cond_Y_X = matrix(
   c( 0.5, 0.0, 3/7,
      0.5, 1.0, 0.0,
      0.0, 0.0, 4/7),
   byrow=TRUE,
   nrow=3)

single_draw = function() {
   x = sample(c(1,2,3), size=1, prob=margin_X)
   y = sample(c(1,2,3), size=1, prob=cond_Y_X[,x])
   return(c(x,y))
}

nsim=10000
df = data.frame(
   t(replicate(nsim, single_draw()))
)
colnames(df) = c("x","y")
t(table(df))
##    x
## y      1    2    3
##   1  962    0 2963
##   2 1012 1023    0
##   3    0    0 4040

The variables margin_X and cond_X_Y are what you get when you compute the marginal and the conditional distribution from the given joint-distribution table as you did in your probability class.

The function single_draw performs a single draw form the distribution of \((X,Y)\) by first drawing the value of \(X\) from its marginal distribution. Then, it chooses the row of the conditional distribution table according to the obtained value of \(X\) and simulates from it.

The function replicate is used to repeat single_draw many times and collect the results. By default, replicate attaches the output of each new “replication” as a new column and not a row, so we need to transpose the final product. That is what the function t() is for. We turn the result into a data frame because the function table knows how to handle data frames automatically. Another use of the transpose gives x the horizontal axis, and y the vertical one, like in the statement of the problem.

Let \((Z,W)\) is be a discrete random vector with the following joint distribution table (different rows correspond to different values of \(Z\)):

-1 0 1
-1 0.2 0.1 0.0
0 0.1 0.2 0.1
1 0.0 0.1 0.2

and set \[ X = Z+W \text{ and } Y=Z W.\]

  1. Find the distribution table for \((X,Y)\) and compute \({\mathbb{E}}[ X Y]\) and \({\mathbb{P}}[X=0 | Y=0]\) analytically (no simulations!). You can do this part on a separate piece of paper or inside your Rmd file if you know LaTeX.

  2. Draw \(10000\) simulations of \((X,Y)\) (don’t print them out). Display the contingency table of your simulations using relative frequencies, as well as the table of errors (differences between your contingency table and the table of probabilities).

  3. Compute \({\mathbb{E}}[X Y]\) again, but this time using your simulations from 1. above.

  4. Draw simulations from the conditional distribution of \(X\), given \(Y=0\). Use them to estimate the conditional probability \({\mathbb{P}}[ X=0 | Y=0]\). By how much does it differ from your analytical result from 1. above.

Harry Potter’s cousin Nigel Potter owns a magical set of three dice. They behave just like any three (fair, independent) 6-sided dice, except for the fact that, when thrown together, they always show three distinct numbers. In other words, the outcomes with repeating numbers never happen, while all combinations of three distinct numbers are equally likely.

  1. Create a large (\(\ge 10,000\)) set of simulations of throws of these magical dice. Use any method you like.

  2. Output the contingency table for the outcomes of the first two dice, and compare to theoretical probabilities.

  3. Draw samples from the conditional distribution of the outcome of the third die, given that the sum on the first two is \(6\). Display the contingency table for your outcomes.

  4. (Extra credit) Find the (theoretical) conditional distribution of the third die given that the sum on the first two is \(6\), and compare it to your result from 3. above.

Click for Solution


  1. Here are three different methods you can use to do this problem:

    Sampling without replacement:

    The three magical dice sample without replacement from the set \(\{1,2,3,4,5,6\}\), so we can simply do the following:

    nsim = 10000
    sims = data.frame(t(replicate(nsim, sample(c(1, 2, 3, 4, 5, 6), size = 3, replace = FALSE))))
    colnames(sims) = c("x", "y", "z")

    The rejection method:

    We simulate three regular dice, and then simply reject all outcomes where two of the numbers are the same:

    nsim = 25000
    x0 = sample(c(1, 2, 3, 4, 5, 6), size = nsim, replace = TRUE)
    y0 = sample(c(1, 2, 3, 4, 5, 6), size = nsim, replace = TRUE)
    z0 = sample(c(1, 2, 3, 4, 5, 6), size = nsim, replace = TRUE)
    
    good = (x0 != y0) & (y0 != z0) & (x0 != z0)
    
    sims = data.frame(x = x0[good], y = y0[good], z = z0[good])
    nsim = dim(sims)[1]  # we rejected a random number of draws

    Using sample:

    We need to make the list of all allowed combinations (no repeats) and then sample the entire triplet of dice at once. I am using rbind to append a row to a data frame, but you can do this in many other ways.

    nsim = 100000
    df = data.frame()
    for (i in 1:6) {
        for (j in 1:6) {
            for (k in 1:6) {
                if ((i != j) & (j != k) & (i != k)) {
                    df = rbind(df, c(i, j, k))
                }
            }
        }
    }
    colnames(df) = c("x", "y", "z")
    
    rows = as.numeric(row.names(df))  # as.numeric needed to turn strings into integers
    sampled_rows = sample(rows, nsim, replace = TRUE)
    
    sims = df[sampled_rows, ]
  2. The probability of seeing a pair \((i,j)\) is \(0\) if \(i=j\) and \(1/30\) otherwise (where \(30\) is the number of pairs of different numbers).

    table_sim = table(sims$x, sims$y)/nsim
    table_th = matrix(1/30, ncol = 6, nrow = 6)
    for (i in 1:6) table_th[i, i] = 0
    (error = table_sim - table_th)
    ##    
    ##                   1               2               3               4
    ##   1  0.000000000000  0.000576666667  0.000196666667  0.000236666667
    ##   2  0.000096666667  0.000000000000 -0.000633333333  0.001136666667
    ##   3 -0.001023333333  0.000046666667  0.000000000000 -0.000333333333
    ##   4  0.000046666667  0.000956666667 -0.000513333333  0.000000000000
    ##   5 -0.000173333333  0.000066666667 -0.000563333333  0.000476666667
    ##   6 -0.000283333333  0.001056666667 -0.000953333333  0.000246666667
    ##    
    ##                   5               6
    ##   1 -0.000243333333  0.000846666667
    ##   2 -0.000763333333  0.000006666667
    ##   3 -0.000553333333  0.000486666667
    ##   4  0.000296666667 -0.000713333333
    ##   5  0.000000000000  0.000136666667
    ##   6 -0.000163333333  0.000000000000
  3. good = (sims$x + sims$y == 6)
    z_cond = sims$z[good]
    (table_sim = table(z_cond)/length(z_cond))
    ## z_cond
    ##         1         2         3         4         5         6 
    ## 0.1305829 0.1236205 0.2494630 0.1229539 0.1245093 0.2488705
  4. We can reuse the code from above (simulation using sample) to list all possible elementary outcomes with \(x+y=6\) and then count the frequencies of different values of the third die in this list (or we could simply do that on a piece of paper):

    df = data.frame()
    for (i in 1:6) {
        for (j in 1:6) {
            for (k in 1:6) {
                if ((i != j) & (j != k) & (i != k) & (i + j == 6)) {
                    df = rbind(df, c(i, j, k))
                }
            }
        }
    }
    colnames(df) = c("x", "y", "z")
    (table_th = table(df$z)/length(df$z))
    ## 
    ##     1     2     3     4     5     6 
    ## 0.125 0.125 0.250 0.125 0.125 0.250

    The error is given by:

    options(digits = 2)
    (error = table_sim - table_th)
    ## z_cond
    ##        1        2        3        4        5        6 
    ##  0.00558 -0.00138 -0.00054 -0.00205 -0.00049 -0.00113

Exactly one percent of the people in a given population have a certain disease. The accuracy of the diagnostic test for it is such that it detects the sick as sick with probability \(0.95\) and the healthy as healthy with probability \(0.9\). A person chosen at random from the population tested positive. What is the probability the he/she is, in fact, sick. Do the problem both analytically and by Monte Carlo.

Click for Solution
This is a classical example of the unexpected conclusions we sometimes get from the Bayes formula. As usual, we depict the situation using a tree diagram:

The person can test positive (denoted by \(tS\) in the plot) in two ways. By actually being sick (\(S\)) and then testing positive, or by being healthy and then testing positive. Bayes formula (or simply a look at the picture above) gives us \[ {\mathbb{P}}[ S | tS ] = \frac{ 0.01 \times 0.95}{ 0.01\times 0.95 + 0.1\times 0.99 } \approx 0.088.\] Thus, even when the test is quite accurate, the probability of getting a false positive is very high.

Let us do the same via Monte Carlo. We proceed like this. First we “pick a person” from the population by sampling from c("H", "S") and then “test” this person. After repeating this nsim times, we condition on the positive test, by removing all draws where the test was negative. This leaves us with a population of people who tested positive, and we simply need to see what proportion of those were are actually sick.

single_draw = function() {
    x = sample(c("H", "S"), size = 1, prob = c(0.99, 0.01))
    if (x == "H") {
        y = sample(c("tH", "tS"), size = 1, prob = c(0.9, 0.1))
    } else {
        y = sample(c("tH", "tS"), size = 1, prob = c(0.05, 0.95))
    }
    return(c(x, y))
}

nsim = 100000
df = data.frame(t(replicate(nsim, single_draw())))
colnames(df) = c("status", "test_result")

cond = (df$test_result == "tS")
df_cond = df[cond, ]
(prob = mean(df_cond$status == "S"))
## [1] 0.089

A point is chosen at random, uniformly in the unit cube \([0,1]\times [0,1]\times [0,1]\). Its distance to the origin \((0,0,0)\) is measured, and turns out to be equal to \(1.5\).

Use simulations to estimate the shape of the pdf of the conditional distribution of the point’s distance to \((1,1,1)\). Compare it to the unconditional case, i.e., the case where no information about the distance to \((0,0,0)\) is known.

Compute the mean of this (conditional) distribution for a few values of the parameter \(\varepsilon\) you use to deal with conditioning in the continuous case. Make sure you include values of \(\varepsilon\) on both sides of the spectrum - too big, and too small.

Click for Solution

We want to vary the parameter eps later, so let’s write a function first:

simulate = function(nsim, eps, conditional) {
    x = runif(nsim)
    y = runif(nsim)
    z = runif(nsim)
    d1 = sqrt((1 - x)^2 + (1 - y)^2 + (1 - z)^2)
    if (conditional) {
        d0 = sqrt(x^2 + y^2 + z^2)
        cond = (d0 > 1.5 - eps) & (d0 < 1.5 + eps)
        return(d1[cond])
    } else {
        return(d1)
    }
}

Histograms may be used as approximations to the pdf of the (conditional) distribution:

nsim = 1000000
eps = 0.1
d1_cond = simulate(nsim, eps, conditional = TRUE)
d1 = simulate(nsim, eps, conditional = FALSE)
par(mfrow = c(1, 2))
hist(d1, breaks = 50)
hist(d1_cond, breaks = 50)

Note that, in addition to clearly different shapes, the supports of the two distributions differ, too. Unconditionally, the distance to \((1,1,1)\) can be any number from \(0\) to \(\sqrt{3} \approx 1.73\). If it is known that the distance to \((0,0,0)\) is \(1.5\), however, the distance to \((1,1,1)\) cannot be larger than \(1\).

Finally, let us compare the results we obtain by varying the parameter \(\varepsilon\), first with nsim=100000:

nsim = 100000
epss = c(2, 1, 0.5, 0.3, 0.2, 0.1, 0.02, 0.01, 0.001, 0.0001)
d1s = vector(length = length(epss))
for (eps in epss) {
    sims = simulate(nsim, eps = eps, conditional = TRUE)
    print(paste("Eps = ", eps, ", Draws = ", length(sims), " Mean = ", mean(sims)))
}
## [1] "Eps =  2 , Draws =  100000  Mean =  0.960342228444094"
## [1] "Eps =  1 , Draws =  93544  Mean =  0.929726634933677"
## [1] "Eps =  0.5 , Draws =  47806  Mean =  0.753283074403369"
## [1] "Eps =  0.3 , Draws =  20224  Mean =  0.595529938750379"
## [1] "Eps =  0.2 , Draws =  10658  Mean =  0.496362640888308"
## [1] "Eps =  0.1 , Draws =  3930  Mean =  0.36209037346251"
## [1] "Eps =  0.02 , Draws =  647  Mean =  0.310139851430518"
## [1] "Eps =  0.01 , Draws =  354  Mean =  0.30926108659397"
## [1] "Eps =  0.001 , Draws =  33  Mean =  0.334224126737702"
## [1] "Eps =  0.0001 , Draws =  4  Mean =  0.275311021507976"

The same experiment, but with nsim=1000000 yields:

nsim = 1000000
epss = c(2, 1, 0.5, 0.3, 0.2, 0.1, 0.02, 0.01, 0.001, 0.0001)
d1s = vector(length = length(epss))
for (eps in epss) {
    sims = simulate(nsim, eps = eps, conditional = TRUE)
    print(paste("Eps = ", eps, ", Draws = ", length(sims), " Mean = ", mean(sims)))
}
## [1] "Eps =  2 , Draws =  1000000  Mean =  0.960506260771782"
## [1] "Eps =  1 , Draws =  934686  Mean =  0.928432784386909"
## [1] "Eps =  0.5 , Draws =  477123  Mean =  0.753640516540863"
## [1] "Eps =  0.3 , Draws =  201634  Mean =  0.596081456550207"
## [1] "Eps =  0.2 , Draws =  103279  Mean =  0.494799342061533"
## [1] "Eps =  0.1 , Draws =  38529  Mean =  0.361335849359528"
## [1] "Eps =  0.02 , Draws =  6944  Mean =  0.3064588479179"
## [1] "Eps =  0.01 , Draws =  3371  Mean =  0.30560770769072"
## [1] "Eps =  0.001 , Draws =  331  Mean =  0.301734543251831"
## [1] "Eps =  0.0001 , Draws =  33  Mean =  0.31500400469211"


  1. Simulate \(10000\) draws from a uniform distribution inside the cube \([-1,1]\times[-1,1] \times [-1,1]\). Go through your simulations, and discard the ones that do not lie within the unit ball (the ball centered around \((0,0,0)\), with radius \(1\).) Now you have a bunch of uniform simulations from the unit ball (Do not display them).

  2. Use your simulations to estimate the mean and the standard deviation of \(W\), where \(W\) is the distance from the the origin to a randomly and uniformly chosen point in the unit ball.

  3. A point was chosen randomly and uniformly in the unit ball, and it has been observed that it is closer to the point \((1,-1,1)\) than to \((1,-1,-1)\). Estimate (graphically) the shape of the conditional pdf of its distance to the origin, given this observation.

  4. (extra credit) Do 3., but with the point chosen uniformly over the surface of the ball and for the distance to the point \((1,1,1)\) and not to the origin (which is always \(1\) in this case). Even more extra points if you do not draw any additional simulations.

Click for Solution


  1. We draw the \(x\), \(y\) and \(z\) coordinates independet of each other, and uniformly in \([-1,1]\). Then we discard those simulatione where the sum of the squares is \(>1\):

    nsim = 10000
    x_cube = runif(nsim, min = -1, max = 1)
    y_cube = runif(nsim, min = -1, max = 1)
    z_cube = runif(nsim, min = -1, max = 1)
    
    in_ball = x_cube^2 + y_cube^2 + z_cube^2 <= 1
    
    x = x_cube[in_ball]
    y = y_cube[in_ball]
    z = z_cube[in_ball]
  2. By Monte Carlo

    W = sqrt(x^2 + y^2 + z^2)
    mean(W)
    ## [1] 0.75
    sd(W)
    ## [1] 0.19

    Note that the mean is not half the radius, i.e. \(0.5\). Why?

  3. We add two new variables, \(d1\) and \(d2\) - distances to \((1,-1,1)\) and \((1,-1,-1)\), respectively:

    d1 = sqrt((x - 1)^2 + (y + 1)^2 + (z - 1)^2)
    d2 = sqrt((x - 1)^2 + (y + 1)^2 + (z + 1)^2)
    condition = d1 < d2
    W_cond = W[condition]
    hist(W_cond, breaks = 50, probability = T)

  4. The main difficulty in this part of the problem is getting simulations from the uniform distributions on the sphere. One possibility is to use the rejection method, and proceed as in part 1., but keep only those points for which \[1-\varepsilon < \sqrt{x^2+y^2+z^2} < 1+\varepsilon\] for some (small) \(\varepsilon\). There is a more efficient method, though, and it does not involve any new simulations. We simply take our uniformly distributed points inside the unit ball and project each one onto the sphere, i.e. replace it by the closest point on the surface. This is easily achieved since the projection of the point \((x,y,z)\) is \[\Big( \frac{x}{\sqrt{x^2+y^2+z^2}}, \frac{y}{\sqrt{x^2+y^2+z^2}}, \frac{y}{\sqrt{x^2+y^2+z^2}}, \Big).\]

    A moment’s though will convince you that those points indeed cover the sphere uniformly (no direction is preferred!).

    Since we already have the distance to the origin stored in the variable W, the code will be extremely simple:

    x_s = x/W
    y_s = y/W
    z_s = z/W

    The rest now parallels what happened in 2.

    d1 = sqrt((x_s - 1)^2 + (y_s + 1)^2 + (z_s - 1)^2)
    d2 = sqrt((x_s - 1)^2 + (y_s + 1)^2 + (z_s + 1)^2)
    condition = d1 < d2
    
    W1 = sqrt((x - 1)^2 + (y - 1)^2 + (z - 1)^2)
    W1_cond = W1[condition]
    hist(W1_cond, breaks = 50, probability = T)

Endnotes

Random Walks

What are stochastic processes?

A stochastic process is a sequence - finite or infinite - of random variables. We usually write \(\{X_n\}_{n\in{\mathbb{N}}_0}\) or \(\{X_n\}_{0\leq n \leq T}\), depending on whether we are talking about an infinite or a finite sequence. The number \(T\in {\mathbb{N}}_0\) is called the time horizon, and we sometimes set \(T=+\infty\) when the sequence is infinite. The index \(n\) is often interpreted as time, so that a stochastic process can be thought of as a model of a random process evolving in time. The initial value of the index \(n\) is often normalized to \(0\), even though other values may be used. This it usually very clear from the context.

It is important that all the random variables \(X_0, X_1,\dots\) “live” on the same sample space \(\Omega\). This way, we can talk about the notion of a trajectory or a sample path of a stochastic process: it is, simply, the sequence of numbers \[X_0(\omega), X_1(\omega), \dots\] but with \(\omega\in \Omega\) considered “fixed”. In other words, we can think of a stochastic process as a random variable whose value is not a number, but sequence of numbers. This will become much clearer once we introduce enough examples.

The Simple Symmetric Random Walk

A stochastic process \(\{X\}_{n\in{\mathbb{N}}_0}\) is said to be a simple symmetric random walk (SSRW) if

  1. \(X_0=0\),

  2. the random variables \(\delta_1 = X_1-X_0\), \(\delta_2 = X_2 - X_1\), …, called the steps of the random walk, are independent

  3. each \(\delta_n\) has a coin-toss distribution, i.e., its distribution is given by \[{\mathbb{P}}[ \delta_n = 1] = {\mathbb{P}}[ \delta_n=-1] = \tfrac{1}{2} \text{ for each } n.\]

Some comments:

  • This definition captures the main features of an idealized notion of a particle that gets shoved, randomly, in one of two possible directions, over and over. In other words, these “shoves” force the particle to take a step, and steps are modeled by the random variables variables \(\delta_1,\delta_2, \dots\). The position of the particle after \(n\) steps is \(X_n\); indeed, \[X_n = \delta_1 + \delta_2 + \dots + \delta_n \text{ for }n\in {\mathbb{N}}.\] It is important to assume that any two steps are independent of each other - the most important properties of random walks depend on this in a critical way.

  • Sometimes, we only need a finite number of steps of a random walk, so we only care about the random variables \(X_0, X_1,\dots, X_T\). This stochastic process (now with a finite time horizon \(T\)) will also be called a random walk. If we want to stress that the horizon is not infinite, we sometimes call it the finite-horizon random walk. Whether \(T\) is finite or infinite is usually clear from the context.

  • The starting point \(X_0=0\) is just a normalization. Sometimes we need more flexibility and allow our process to start at \(X_0=x\) for some \(x\in {\mathbb{N}}\). To stress that fact, we talk about the random walk starting at \(x\). If no starting point is mentioned, you should assume \(X_0=0\).

  • We will talk about biased (or asymmetric) random walks a bit later. The only difference will be that the probabilities of each \(\delta_n\) taking values \(1\) or \(-1\) will be \(p\in (0,1)\) and \(1-p\), and not necessarily \(\tfrac{1}{2}\), The probability \(p\) cannot change from step to step and the steps \(\delta_1, \delta_2, \dots\) will continue to be independent from each other.

  • The word simple in its name refers to the fact that distribution of every step is a coin toss. You can easily imagine a more complicated mechanism that would govern each step. For example, not only the direction, but also the size of the step could be random. In fact, any distribution you can think of can be used as a step distribution of a random walk. Unfortunately, we will have very little to say about such, general, random walks in these notes.

How to simulate random walks

In addition to being quite simple conceptually, random walks are also easy to simulate. The fact that the steps \(\delta_n = X_n - X_{n-1}\) are independent coin tosses immediately suggests a feasible strategy: simulate \(T\) independent coin tosses first, and then define each \(X_n\) as the sum of the first \(n\) tosses.

Before we implement this idea in R, let us agree on a few conventions which we will use whenever we simulate a stochastic process:

  • the result of each simulation is a data.frame object
  • its columns will be the random variables \(X_0\), \(X_1\), It is a good idea to name your columns X0, X1, X2, etc.
  • each row will represent one “draw”:

This is best achieved by the following two-stage approach in R:

  1. write a function which will simulate a single trajectory of your process, If your process comes with parameters, it is a good idea to include them as arguments to this function.

  2. use the function replicate to stack together many such simulations and convert the result to a data.frame. Don’t forget to transpose after (use the function t) because replicate works column by column, and not row by row.

Let’s implement this in the case of a simple random walk. Of course, it is impossible to simulate a random walk on an infinite horizon (\(T=\infty\)) so we must restrict to finite-horizon random walks6. The function cumsum which produces partial sums of its input comes in very handy.

single_trajectory = function(T, p = 0.5) {
    delta = sample(c(-1, 1), size = T, replace = TRUE, prob = c(1 - p, p))
    x = cumsum(delta)
    return(x)
}

Next, we run the same function nsim times and record the results. It is a lucky break that the default names given to columns are X1, X2, … so we don’t have to rename them. We do have to add the zero-th column \(X_0=0\) because, formally speaking, the “random variable” \(X_0=0\) is a part of the stochastic process. This needs to be done before other columns are added to maintain the proper order of columns, which is important when you want to plot trajectories.

simulate_walk = function(nsim, T, p = 0.5) {
  return(
    data.frame(
      X0 = 0,
      t(replicate(nsim, single_trajectory(T, p)))
    ))
}
walk = simulate_walk(nsim = 10000, T = 500)

Two ways of looking at a stochastic proceses

Now that we have the data frame walk, we can explore in at least two qualitatively different ways:

Column-wise (distributionally)

Here we focus on individual random variables (column) or pairs, triplets, etc. of random variables and study their (joint) distributions. For example, we can plot histograms of the random variables \(X_5, X_8, X_{30}\) or \(X_{500}\):

We can also use various (graphical or not) devices to understand joint distributions of pairs of random variables:

Row-wise (trajectorially or path-wise)

If we focus on what is going on in a given row of walk, we are going to see a different cross-section of our stochastic process. This way we are fixing the state of the world \(\omega\) (represented by a row of walk), i.e., the particular realization of our process, by varying the time parameter. A typical picture associated to a trajectory of a random walk is the following

You can try to combine the two approaches (if you must) and plot several trajectories on the same plot. While this produces pretty pictures (and has one or two genuine applications), it usually leads to a sensory overload. Note that the trajectories on the righr are jittered a bit. That means that the positions of the points are randomly shifted by a small amount. This allows us to see features of the plot that would otherwise be hidden because of the overlap.

The path space

The row-wise (or path-wise or trajectory-wise) view of the random walk described above illustrates a very important point: the random walk (and random processes in general) can be seen as random “variable” whose values are not merely numbers; they are sequences of numbers (trajectories). In other words, a random process is simply a “random trajectory”. We can simulate this random trajectory as we did above, but simulating the steps and adding them up, but we could also take a different approach. We could build the set of all possible trajectories, and then pick a random trajectory out of it.

For a random walk on a finite horizon \(T\), a trajectory is simply a sequence of natural numbers starting from \(0\). Different realizations of the coin-tosses \(\delta_n\) will lead to different trajectories, but not every sequence of natural numbers corresponds to a trajectory. For example \((0,3,4,5)\) is not possible because the increments of the random walk can only take values \(1\) or \(-1\). In fact, a finite sequence \((x_0, x_1, \dots, x_T)\) is a (possible) sample path of a random walk if and only if \(x_0=0\) and \(x_{k}-x_{k-1} \in \{-1,1\}\) for each \(k\). For example, when \(T=3\), there are \(8\) possible trajectories: \[ \begin{align} \Omega = \{ &(0,1,2,3), (0,1,2,1),(0,1,0,2), (0,1,0,-1), \\ & (0,-1,-2,-3), (0,-1,-2,-1), (0,-1,0,-2), (0,-1,0,1)\} \end{align}\] When you (mentally) picture them, think of their graphs:

Each trajectory corresponds to a particular combination of the values of the increments \((\delta_1,\dots, \delta_T)\), each such combination happens with probability \(2^{-T}\). This means that any two trajectories are equally likely. That is convenient, because this puts uniform probability on the collection of trajectories. We are now ready to implement our simulation procedure in R; let us write the function single_trajectory using this approach and use it to simulate a few trajectories. We assume that a function all_paths(T) which returns a list of all possible paths with horizon \(T\) has already been implemented (more info about a possible implementation in R is given in a problem below):

T=5
Omega = all_paths(T)
single_trajectory = function() {
  return(unlist(sample(size=1,Omega)))
}

simulate_walk = function(nsim, p=0.5) {
   return(data.frame(
            X0=0, 
            t(replicate(nsim, single_trajectory()))
        ))
}

The distribution of \(X_n\)

Building a path space is not simply an exercise in abstraction. Here is how we can use is to understand the distribution of the position of the random walk:

Let \(X\) be a simple symmetric random walk with time horizon \(T=5\). What is the probability that \(X_{5}=1\)?

Let \(\Omega\) be the path space, i.e., the set of all possible trajectories of length \(5\) - there are \(2^{5}=32\) of them. The probability that \(X_{5}=1\) is the probability that a randomly picked path from \(\Omega\) will take the value \(1\) at \(n=5\). Since all paths are equally likely, we need to count the number of paths with value \(1\) at \(n=5\) and then divide by the total number of paths, i.e., \(32\).

So, how many paths are there that take value \(1\) at \(n=5\)? Each path is built out of steps of absolute value \(1\). Some of them go up (call them up-steps) and some of them go down (down-steps). A moment’s though reveals that the only way to reach \(1\) in \(5\) steps is if you have exactly \(3\) up-steps and \(2\) down-steps. Conversely, any path that has \(3\) up-steps and \(2\) down-steps ends at \(1\).

This realization transforms the problem into the following: how many paths are there with exactly \(3\) up-steps (note that we don’t have to specify that there are \(2\) down-steps - it will happen automatically). The only difference between different paths with exactly \(3\) up-steps is the position of these up-steps. In some of them the up-steps happen right at the start, in some at the very end, and in some they are scattered around. Each path with \(3\) up-steps is uniquely determined by the list of positions of those up-steps, i.e., with a size-\(3\) subset of \(\{1,2,3,4,5\}\). This is not a surprise at all, since each path is built out of increments, and positions of positive increments clearly determine values of all increments.

The problem has now become purely mathematical: how many size-\(3\) subsets of \(\{1,2,3,4,5\}\) are there? The answer comes in the form of a binomial coefficient \(\binom{5}{3}\) whose value is \(10\) - there are exactly ten ways to pick three positions out of five. Therefore, \[ {\mathbb{P}}[ X_{5} = 1] = 10 \times 2^{-5} = \frac{5}{16}.\]

Can we do this in general?

Let \(X\) be a simple symmetric random walk with time horizon \(T\). What is the probability that \(X_{n}=k\)?

The reasoning from the last example still applies. A trajectory with \(u\) up-steps and \(d\) down-steps will end at \(u-d\), so we must have \(u-d=k\). On the other hand \(u+d=n\) since all steps that are not up-steps are necessarily down-steps. This gives as a simple linear system with two equations and two unknowns which solves to \(u = (n+k)/2\), \(d=(n-k)/2\). Note the \(n\) and \(k\) must have the same parity for this solution to be meaningful. Also, \(k\) must be between \(-n\) and \(n\).

Having figured out how many up-steps is necessary to reach \(k\), all we need to do is count the number of trajectories with that many up-steps. Like before, we can do that by counting the number of ways we can choose their position among \(n\) steps, and, like before, the answer is the binomial coefficient \(\binom{n}{u}\) where \(u=(n+k)/2\). Dividing by the total number of trajectories gives us the final answer: \[ {\mathbb{P}}[ X_n = k ] = \binom{n}{ (n+k)/2} 2^{-n},\] for all \(k\) between \(-n\) and \(n\) with same parity as \(n\). For all other \(k\), the probability is \(0\).

The binomial coefficient and the \(n\)-th power suggest that the distribution of \(X_n\) might have something to do with the binomial distribution. It is clearly not the binomial, since it can take negative values, but it is related. To figure out what is going on, let us first remember what the binomial distribution is all about. Formally, it is a discrete distribution with two parameters \(n\in{\mathbb{N}}\) and \(p\in (0,1)\). Its support is \(\{0,1,2,\dots, n\}\) and the distribution is given by the following table, where \(q=1-p\)

0 1 2 k n
\(\binom{n}{0} q^n\) \(\binom{n}{1} p q^{n-1}\) \(\binom{n}{2} p^2 q^{n-2}\) \(\binom{n}{k} p^k q^{n-k}\) \(\binom{n}{n} p^n\)

The binomial distribution is best understood, however, when it is expressed as a “number of successes”. More precisely,

If \(B_1,B_2,\dots, B_n\) are \(n\) independent Bernoulli random variables with the same parameter \(p\), then their sum \(B_1+\dots+B_n\) has a binomial distribution with parameters \(n\) and \(p\).

We think of \(B_1, \dots, B_n\) as indicator random variables of “successes” in \(n\) independent “experiments” each of which “succeeds” with probability \(p\). A canonical example is tossing a biased coin \(n\) times and counting the number of “heads”.

We know that the position \(X_n\) at time \(n\) of the random walk admits the representation \[ X_n = \delta_1+\delta_2+\dots+\delta_n,\] just like the binomial random variable. The distribution of \(\delta_k\) is not Bernoulli, though, since it takes the values \(-1\) and \(1\), and not \(0,1\). This is easily fixed by applying the linear transformation \(x\mapsto \frac{1}{2}(x+1)\); indeed \(( -1 +1)/2 = 0\) and \(( 1 + 1) / 2 =1\), and, so, \[ \frac{1}{2}(\delta_k+1)\text{ is a Bernoulli random variable with parameter } p=\frac{1}{2}.\] Consequently, if we add all \(B_k = \tfrac{1}{2}(1+\delta_k)\) and remember our discussion from above we get the following statement

In a simple symmetric random walk the random variable \(\frac{1}{2} (n + X_n)\) has the binomial distribution with parameters \(n\) and \(p=1/2\), for each \(n\).

Can you use that fact to rederive the distribution of \(X_n\)?

Biased random walks

If the steps of the random walk preferred one direction to the other, the definition would need to be tweaked a little bit and the word “symmetric” in the name gets replaced by “biased” (or “asymmetric”):

A stochastic process \(\{X\}_{n\in{\mathbb{N}}_0}\) is said to be a **simple biased random walk with parameter \(p\in (0,1)\) if

  1. \(X_0=0\),

  2. the random variables \(\delta_1 = X_1-X_0\), \(\delta_2 = X_2 - X_1\), …, called the steps of the random walk, are independent and

  3. each \(\delta_n\) has a biased coin-toss distribution, i.e., its distribution is given by \[{\mathbb{P}}[ \delta_n = 1] = p \text{ and } {\mathbb{P}}[ \delta_n=-1] = 1-p \text{ for each } n.\]

As far as the distribution of \(X_n\) is concerned, we don’t expect it to be the same as in the symmetric case. After all, the biased random walk (think \(p=0.999\)) will prefer one direction over the other. Our trick with writing \(\frac{1}{2}(n+X_n)\) as a sum of Bernoulli random variables still works. We just have to remember that \(p\) is not \(\frac{1}{2}\) anymore to conclude that \(\tfrac{1}{2}(X_n + n)\) has the binomial distribution with parameters \(n\) and \(p\); if we put \(u = (n+k)/2\) we get \[\begin{align} {\mathbb{P}}[ X_n = k] &= {\mathbb{P}}[ \tfrac{1}{2}(X_n+n) = u] = \binom{n}{u} p^u q^{n-u}\\ & = \binom{n}{\frac{1}{2}(n+k)} p^{\frac{1}{2}(n+k)} q^{\frac{1}{2}(n-k)}. \end{align}\] Note that be binomial coefficient stays the same as in the symmetric case, but the factor \(2^{-n} = (1/2)^{\frac{1}{2}(n+k)} (1/2)^{\frac{1}{2}(n-k)}\) becomes \(p^{\frac{1}{2}(n+k)} q^{\frac{1}{2}(n-k)}\).

Can we reuse the sample space \(\Omega\) to build a biased random walk? Yes, we can, but we need to assign possibly different probabilities to individuals. Indeed, if \(p=0.99\), the probability that all the increments \(\delta\) of a \(10\)-step random walk take the value \(+1\) is \((0.99)^{10} \approx 0.90\). This is much larger than the probability that all steps take the value \(-1\), which is \((0.01)^{10}= 10^{-20}\).

In general, the probability that a particular path is picked out of \(\Omega\) will depend on the number of up-steps and down-steps; more precisely it equals \(p^u q^{n-u}\) where \(u\) is the number of up-steps. The interesting thing is that the number of up-steps \(u\) depends only on the final position \(x_n\) of the path; indeed \(u = \frac{1}{2}(n+x_n)\). This way, all paths of length \(T=5\) that end up at \(1\) get the same probability of being chosen, namely \(p^3 q^2\). Let us use the awful seizure-inducing graph with multiple paths for good, and adjust the each path according to its probability; some jitter has been added to deal with overlap. The lighter-colored paths are less likely to happen then the darker-colored paths.

Additional problems for Chapter 3

Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a simple symmetric random walk. Which of the following processes are simple random walks?

  1. \(\{2 X_n\}_{n\in {\mathbb{N}}_0}\) ?

  2. \(\{X^2_n\}_{n\in {\mathbb{N}}_0}\) ?

  3. \(\{-X_n\}_{n\in {\mathbb{N}}_0}\) ?

  4. \(\{ Y_n\}_{n\in {\mathbb{N}}_0}\), where \(Y_n = X_{5+n}-X_5\) ?

How about the case \(p\ne \tfrac{1}{2}\)?

Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a simple random walk.

  1. Find the distribution of the product \(X_1 X_2\)

  2. Compute \({\mathbb{P}}[ |X_1 X_2 X_3|]=2\)

  3. Find the probability that \(X\) will hit neither the level \(2\) nor the level \(-2\) until (and including) time \(T=3\)

  4. Find the independent pairs of random variables among the following choices:

    • \(X_1\) and \(X_2\)
    • \(X_4 - X_2\) and \(X_3\)
    • \(X_4 - X_2\) and \(X_6 - X_5\)
    • \(X_1+X_3\) and \(X_2+X_4\).

Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a simple random walk.

  1. Compute \({\mathbb{P}}[ X_{32} = 4| X_8 = 6]\).

  2. Compute \({\mathbb{P}}[ X_9 = 3 \text{ and } X_{15}=5 ]\)

  3. (extra credit) Compute \({\mathbb{P}}[ X_7 + X_{12} = X_1 + X_{16}]\)

Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a simple random walk. For \(n\in{\mathbb{N}}\) compute the probability that \(X_{2n}\), \(X_{4n}\) and \(X_{6n}\) take the same value.

Write an R function (call it all_paths) which takes an integer argument T and returns a list of all possible paths of a random walk with time horizon \(T\). (Note: Since vectors cannot have other vectors as elements, you will need to use a data structure called list for this. It behaves very much like a vector, so it should not be a problem.)

Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a simple symmetric random walk. Given \(n\in{\mathbb{N}}_0\) and \(k\in{\mathbb{N}}\), compute \(\operatorname{Var}[X_n]\), \(\operatorname{Cov}[X_n, X_{n+k}]\) and \(\operatorname{corr}[X_n, X_{n+k}]\), where \(\operatorname{Cov}\) stands for the covariance and \(\operatorname{corr}\) for the correlation. (Note: look up \(\operatorname{Cov}\) and \(\operatorname{corr}\) if you forgot what they are).

Compute \(\lim_{n\to\infty} \operatorname{corr}[X_n, X_{n+k}]\) and \(\lim_{k\to\infty} \operatorname{corr}[X_n, X_{n+k}]\). How would you interpret the results you obtained?

Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a simple random walk with \({\mathbb{P}}[X_1=1]=p\in (0,1)\), and let \(A_n\) be the (signed) area under its graph (in the picture below, \(A_n\) is the area of the blue part minus the area of the orange part).

  1. Find a formula for \(A_n\) in terms of \(X_1,\dots, X_n\).

  2. Compute \({\mathbb{E}}[A_n]\) and \(\operatorname{Var}[A_n]\), for \(n\in{\mathbb{N}}\). (You will find the following formulas helpful \(\sum_{j=1}^n j = \frac{n(n+1)}{2}\) and \(\sum_{j=1}^n j^2=\frac{n(n+1)(2n+1)}{6}\).)

  3. Use simulations to approximate the entire distribution of \(A_n\) (set \(p=0.3\), \(n=10\) run \(10000\) simulations): display the entire distribution table. Next, compute Monte-Carlo approximations to \({\mathbb{E}}[A_{10}]\) and \(\operatorname{Var}[A_{10}]\) and compare them to the exact values obtained in 2. above.

Let \(\{X_n\}_{0 \leq n \leq 100}\) be the simple symmetric random walk with time horizon \(T = 100\). We define the following random variables

  • \(L\) is the time of the last zero, i.e., the last time, before or at \(T\), that \(X\) visited \(0\)
  • \(P\) is the amount of time spent above \(0\) i.e., the number of all times \(n\leq T\) with \(X_n>0\).
  • \(R\) is the last time time the maximum is attained, i.e., the last time before or at \(T\), that \(X\) took the value \(M_T\), where \(M\) is the running maximum process associated to \(X\).
  1. Draw nsim simulations of \(L\), \(P\) and \(R\) and display their histograms on the same graph, side by side. Set the number of bins (option breaks in the command hist) to \(50\). Choose the number nsim so that the simulations take no more than 2 minutes, but do not go over \(100,000\).

    (Hint: For each of the three random variables write a function which takes a single trajectory of a random walk as an argument and returns its value for that trajectory. \(P\) is easy, and for \(L\) and \(R\) use the function match. Be careful - match finds the first match, but you need the last one. Then simulate the random walk and apply your functions to the rows of your ouput data frame.)

  2. What do you observe? What conjecture can you make about distributions of \(L\), \(P\) and \(R\)?

  3. (Extra credit) Even though \(L\), \(P\) and \(R\) are discrete random variables, it turns out that distributions of their normalized versions \(L/T\), \(P/T\) and \(R/T\) are close to named continuous distributions. Guess what these distributions are and explain (graphically) why you think your guess is correct.

Endnotes


  1. it may interfere with your existing installation↩︎

  2. be careful, though. The expression x = y is not the same as x == y. It does not return a logical value - it assigns the value of y to x↩︎

  3. There are infinitely many ways random variables can be distributed. Indeed, in the discrete \({\mathbb N}\)-valued case only, any sequence of nonnegative numbers \((p_n)_n\) such that \(\sum_n p_n=1\) defines a probability distribution. It turns out, however, that a small-ish number of distributions appear in nature much more often then the rest. These distributions, like the normal, uniform, exponential, binomial, etc. turn out to be so important that they each get a name (hence named distributions). ↩︎

  4. Some books will define the geometric random variables as the number of tosses (and not Ts) before the first H is obtained. In that case, the final H is included into the count. Clearly, this definition and the one we have given differ by \(1\), and this is really not a big deal, but you have to be careful about what is meant when a geometric random variable is mentioned.↩︎

  5. The function sum adds up all the components of the vector. You would not want such a function to be vectorized. If it were, it would return exactly the same vector it got as input.↩︎

  6. It is somewhat unfortunate that the standard notation for the time horizon, namely \(T\), coincides with a shortcut T for TRUE in R. Our example still works fine because this shortcut is used only if there is no variable named T.↩︎